This should be pretty straightforward to do in R using dplyr, but I am a bit stuck as to how exactly to do this.
I have aggregated by day a set of transaction revenues, and I want to calculate the daily balance using a final balance. In Excel this would be a trivial formula to do - input the first balance and then for subsequent rows subtract the daily revenues from the previous row's balance.
I am trying to do this in dplyr and keep hitting a wall. Any advice as to how I could achieve this would be great. I'm pretty sure you need to use lag() but I can't seem to figure out exactly how.
Sample data:
library(tidyverse)
x <- tibble(date = c('2018-04-03','2018-04-02','2018-04-01','2018-03-31','2018-03-30'),
daily_spend = c(575,-5.26,-112.45,-73.5,25.6))
final_balance <- 1000
Here's what the data looks like:
# A tibble: 5 x 2
date daily_spend
<chr> <dbl>
1 2018-04-03 575.
2 2018-04-02 -5.26
3 2018-04-01 -112.
4 2018-03-31 -73.5
5 2018-03-30 25.6
What I would like to do would be to add an additional column 'balance' and for each row have the value be the previous balance minus the daily spend, to give a daily spend.
Here are some expected values:
# A tibble: 5 x 3
date daily_spend end_balance
<chr> <dbl> <dbl>
1 2018-04-03 575. 1000
2 2018-04-02 -5.26 425
3 2018-04-01 -112. 430.
4 2018-03-31 -73.5 542.71
5 2018-03-30 25.6 616.21
Here's what I have been trying, which doesn't work beyond the first two rows (due to the nature of the way lag() works):
x <- x %>%
mutate(end_balance = ifelse(row_number() ==1,final_balance,0),
end_balance = ifelse(row_number()>1,lag(end_balance)-lag(daily_spend),end_balance))
The results of this method:
# A tibble: 5 x 3
date daily_spend end_balance
<chr> <dbl> <dbl>
1 2018-04-03 575. 1000.
2 2018-04-02 -5.26 425.
3 2018-04-01 -112. 5.26
4 2018-03-31 -73.5 112.
5 2018-03-30 25.6 73.5
Here you go
mutate(x, end_balance = final_balance - cumsum(daily_spend) + daily_spend)
Use final_balance subtract the cumulative sum of daily_spend (lagged):
x %>%
mutate(end_balance = final_balance - cumsum(lag(daily_spend, default = 0))) %>%
as.data.frame()
# date daily_spend end_balance
#1 2018-04-03 575.00 1000.00
#2 2018-04-02 -5.26 425.00
#3 2018-04-01 -112.45 430.26
#4 2018-03-31 -73.50 542.71
#5 2018-03-30 25.60 616.21
Related
I have a data frame with daily observations from several years. Some days are missing from the dataset:
df <- tibble(time = seq(as.Date("2010/1/1"), as.Date("2020/12/31"), "days"),
value = runif(4018))
# reproducing missing days
df <- df[-sample.int(nrow(df), 100),]
I am trying to use dplyr::group_by to group my data frame using the same date range between years. However, the range starts at one year and ends at the next year, e.g. a range between November 15th and February 15h for all the time series. I would like to have one group for each date range, e.g. one group for 2010-11-15 to 2011-02-15, another group for 2011-11-15 to 2012-02-15 and so on.
Any tips?
One approach is to create a separate data.frame that transparently shows the groups assigned and date ranges. Then, you can use the data.frame with fuzzy_inner_join to assign rows to groups, allowing you to use group_by with these group numbers. Alternatives to consider would be using data.table, cut, and/or findInterval. Let me know if this will address your needs.
library(lubridate)
library(tidyverse)
library(fuzzyjoin)
df_group <- data.frame(
group = seq.int(max(year(df$time)) - min(year(df$time)) + 1),
start = seq.Date(as.Date(paste0(min(year(df$time)), "-11-15")), as.Date(paste0(max(year(df$time)), "-11-15")), "years"),
end = seq.Date(as.Date(paste0(min(year(df$time)) + 1, "-02-15")), as.Date(paste0(max(year(df$time)) + 1, "-11-15")), "years")
)
fuzzy_inner_join(
df,
df_group,
by = c("time" = "start", "time" = "end"),
match_fun = list(`>=`, `<=`)
)
Output
time value group start end
<date> <dbl> <int> <date> <date>
1 2010-11-15 0.901 1 2010-11-15 2011-02-15
2 2010-11-16 0.991 1 2010-11-15 2011-02-15
3 2010-11-17 0.430 1 2010-11-15 2011-02-15
4 2010-11-18 0.394 1 2010-11-15 2011-02-15
5 2010-11-19 0.142 1 2010-11-15 2011-02-15
6 2010-11-20 0.280 1 2010-11-15 2011-02-15
7 2010-11-21 0.565 1 2010-11-15 2011-02-15
8 2010-11-22 0.935 1 2010-11-15 2011-02-15
9 2010-11-23 0.358 1 2010-11-15 2011-02-15
10 2010-11-24 0.842 1 2010-11-15 2011-02-15
# … with 941 more rows
I am trying to compute the hedging error for an options pricing model. Each day, I will compute an equivalent position that one should take when hedging against this option in the market, let's call it X_s, and compute the cash position of the hedge, let's call it X_0, for every given day. This doesn't present any issues since I can mapply() a function that calculates all the necessary partials given my parameters, stock price, etc. to compute X_s and X_0. Where I am starting to run into issues is when trying to compute the hedging error for my models. Here's a subset of my data that I'm looking at:
date optionid px_last r X_s_position X_0_cash mp_ba
1 2020-03-03 127117475 3003.37 0.011587702 0.642588548 -1783.881169 146.05
2 2020-03-03 131373646 3003.37 0.011587702 0.527107056 -1477.947518 105.15
3 2020-03-06 127117475 2972.37 0.008128021 0.566540143 -1558.566925 125.40
4 2020-03-09 127117475 2746.56 0.004745339 0.133284145 -332.122900 33.95
5 2020-03-10 127117475 2882.23 0.005884274 0.413389283 -1125.632994 65.85
6 2020-03-11 127117475 2741.38 0.006223502 0.131700734 -333.691757 27.35
7 2020-03-12 127117475 2480.64 0.003787032 0.003680431 -8.179825 0.95
So, let's say we're looking at optionid == 127117475. On the first observation date we won't have any hedge error, so we go to the next observation on 2020-03-06. The hedge error on that day would be
0.642588548*2972.37 + -1783.881169*exp(0.011587702*as.numeric(2020-03-06 - 2020-03-03)/365) - 105.15
So in row 3, in the new 'hedge error' column I want to create, the value would be 20.80985. So, to calculate the hedge error for the next observation of optionid == 127117475, I take the previous observation X_s_position multiply it by the next spot price (px_last), add the X_0_cash value multiplied by exp(r*(difference in days between the two observations)/365) and then subtract the next observation of the option price (mp_ba)
Perhaps like so? Should the mp_ba in your example be 125.40?
library(dplyr)
df %>%
group_by(optionid) %>%
mutate(hedge_error = lag(X_s_position)*px_last + X_0_cash*exp(lag(r)*as.numeric(date - lag(date))/365) - mp_ba)
Result
# A tibble: 7 × 8
# Groups: optionid [2]
date optionid px_last r X_s_position X_0_cash mp_ba hedge_error
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-03-03 127117475 3003. 0.0116 0.643 -1784. 146. NA
2 2020-03-03 131373646 3003. 0.0116 0.527 -1478. 105. NA
3 2020-03-06 127117475 2972. 0.00813 0.567 -1559. 125. 226.
4 2020-03-09 127117475 2747. 0.00475 0.133 -332. 34.0 1190.
5 2020-03-10 127117475 2882. 0.00588 0.413 -1126. 65.8 -807.
6 2020-03-11 127117475 2741. 0.00622 0.132 -334. 27.4 772.
7 2020-03-12 127117475 2481. 0.00379 0.00368 -8.18 0.95 318.
I have a dataset with dates in tibble format from tidyverse/dplyr.
library(tidyverse)
A = seq(from = as.Date("2019/1/1"),to=as.Date("2022/1/1"), length.out = 252*3)
length(A)
x = rnorm(252*3)
d = tibble(A,x);d
Resulting to :
# A tibble: 756 x 2
A x
<date> <dbl>
1 2019-01-01 1.43
2 2019-01-02 0.899
3 2019-01-03 0.658
4 2019-01-05 -0.0720
5 2019-01-06 -1.99
6 2019-01-08 -0.743
7 2019-01-09 0.426
8 2019-01-11 0.00675
9 2019-01-12 0.967
10 2019-01-14 -0.606
# ... with 746 more rows
i also have a date of interest, say:
start = as.Date("2021/12/15");start
I want to subset the dataset from this specific date (start) and one year back. But the year has 252 observations.
i tried :
d%>%
dplyr::filter(A<start)%>%
dplyr::slice_tail(n=252)
but i don't like it because my real dataset has more than one factor label and if i use this then i will have 252 observations.
i also tried :
LAST_YEAR = DATE-365
d%>%
dplyr::filter(Date <= DATE & Date >=LAST_YEAR)
which works but i want to use the 252.Imagine that i want to find 2 years (252*2) back how many observations i have on this specific time interval.
Any help how i can do that?
I have a dataframe with 3 columns : Dates, Tickers (i.e. financial instruments) and Prices.
I just want to calculate the returns for each ticker.
Some data to play with:
AsofDate = as.Date(c("2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05",
"2018-01-01","2018-01-02","2018-01-03","2018-01-04","2018-01-05"))
Tickers = c("Ticker1", "Ticker1", "Ticker1", "Ticker1", "Ticker1",
"Ticker2", "Ticker2", "Ticker2", "Ticker2", "Ticker2",
"Ticker3", "Ticker3", "Ticker3", "Ticker3", "Ticker3")
Prices =c(1,2,7,4,2,
6,5,7,9,12,
11,11,16,14,15)
df = data.frame(AsofDate, Tickers, Prices)
My first Idea was just to order the Prices by (Tickers Prices), then calculate for all the vector and set at NA the first day...
TTR::ROC(x=Prices)
It works in Excel but I want something more pretty
So I tried something like this:
require(dplyr)
ret = df %>%
select(Tickers,Prices) %>%
group_by(Tickers) %>%
do(data.frame(LogReturns=TTR::ROC(x=Prices)))
df$LogReturns = ret$LogReturns
But Here I get too much values, it seems that the calculation is not done by Tickers.
Can you give me a hint ?
Thanks !!
In dplyr, we can use lag to get previous Prices
library(dplyr)
df %>%
group_by(Tickers) %>%
mutate(returns = (Prices - lag(Prices))/Prices)
# AsofDate Tickers Prices returns
# <date> <fct> <dbl> <dbl>
# 1 2018-01-01 Ticker1 1 NA
# 2 2018-01-02 Ticker1 2 0.5
# 3 2018-01-03 Ticker1 7 0.714
# 4 2018-01-04 Ticker1 4 -0.75
# 5 2018-01-05 Ticker1 2 -1
# 6 2018-01-01 Ticker2 6 NA
# 7 2018-01-02 Ticker2 5 -0.2
# 8 2018-01-03 Ticker2 7 0.286
# 9 2018-01-04 Ticker2 9 0.222
#10 2018-01-05 Ticker2 12 0.25
#11 2018-01-01 Ticker3 11 NA
#12 2018-01-02 Ticker3 11 0
#13 2018-01-03 Ticker3 16 0.312
#14 2018-01-04 Ticker3 14 -0.143
#15 2018-01-05 Ticker3 15 0.0667
In base R, we can use ave with diff
df$returns <- with(df, ave(Prices, Tickers,FUN = function(x) c(NA,diff(x)))/Prices)
We can use data.table
library(data.table)
setDT(df)[, returns := (Prices - shift(Prices))/Prices, by = Tickers]
I am struggling to generate a date sequence between two dates in same column using R script.
I have request id and sequence ID, Date and status.
Input table
My requirement is to generate table like this.
desired output table
Any help in this regard would be appreciated.
Thank you
You can do this with the tidyverse libraries. First set your date column to dates with dmy in the lubridate package. Then you can use tidyr functions complete and fill to extend your datatable as shown. complete has the option to fill in the gaps by day. group_by ReqID to do this for each of your individual identifiers.
library(tidyverse)
library(lubridate)
df <- data_frame(ReqID = 100, ID_Seq = 1:3, Created = dmy("01/01/2018","10/01/2018","18/01/2018"), Status = c("Scheduled","In Execution", "Completed"))
df %>%
group_by(ReqID) %>%
complete(Created = seq.Date(min(Created),max(Created), by = "day")) %>%
fill(ReqID,ID_Seq,Status)
## A tibble: 18 x 4
# Created ReqID ID_Seq Status
# <date> <dbl> <int> <chr>
# 1 2018-01-01 100 1 Scheduled
# 2 2018-01-02 100 1 Scheduled
# 3 2018-01-03 100 1 Scheduled
# 4 2018-01-04 100 1 Scheduled
# 5 2018-01-05 100 1 Scheduled
# 6 2018-01-06 100 1 Scheduled
# 7 2018-01-07 100 1 Scheduled
# 8 2018-01-08 100 1 Scheduled
# 9 2018-01-09 100 1 Scheduled
#10 2018-01-10 100 2 In Execution
#11 2018-01-11 100 2 In Execution
#12 2018-01-12 100 2 In Execution
#13 2018-01-13 100 2 In Execution
#14 2018-01-14 100 2 In Execution
#15 2018-01-15 100 2 In Execution
#16 2018-01-16 100 2 In Execution
#17 2018-01-17 100 2 In Execution
#18 2018-01-18 100 3 Completed
Thank you Jasbner! I have installed dplyr and tidyr packages as suggested.
I am using 'mutate' to fix the date format.
my csv file (file.csv) holds these data lines
ReqID Seq Created Status
100 1 01/01/2018 Scheduled
100 2 10/01/2018 Execution
100 3 15/01/2018 Hold
100 4 18/01/2018 Complete
101 1 10/01/2018 Scheduled
101 2 18/01/2018 Execution
101 3 20/01/2018 Complete
102 1 18/01/2018 Scheduled
102 2 22/01/2018 Execution
102 3 25/01/2018 Cancelled
103 1 01/02/2018 Scheduled
# my final r script
mydata<-read.csv('file.csv') # Reading data from csv
myindf<-as.data.frame(mydata) # converting it into data frame
myoutdf <- myindf %>% mutate(Created = dmy(Created)) %>% group_by(ReqID) %>% complete(Created = seq.Date(min(Created),max(Created), by = "day")) %>% fill(ReqID,Seq,Status)
print(myoutdf, n = 38) #print all 38 lines