How to create a summation function with data frame in R? - r

Just for fun, I am trying to create a basic savings calculator. My current code is:
value <- function(years,apr,initial,investment) {
df <- as.data.frame(matrix(nrow = years, ncol = 2))
colnames(df) <- c("year","value")
df$years <- c(1:years)
for (i in 1:years) {
current_value <-(last_value+investment)*apr
}
#repeating calculation for the data frame
print(df)
What I am trying to do is have the calculator create a table that displays the value each year. I've adapted my code from an old homework assignment, so I am not concerned with how to make the data frame. However, I do not know how to make the formula for the summation.
I am trying to model
Current Value = (Cumulative Value + Investment)*(Annual Percentage Rate)
As an example, let's say initial value is 10, investment is 10, and the APR is 1.05
(10+10)*(1.05)=21
(21+10)*(1.05)=32.55
(32.55+10)*(1.05)=44.68
and so on.
Year is there to number the rows accordingly.

We can use Reduce with accumulate = TRUE
calc_fun <- function(years,apr,initial,investment) {
value <- Reduce(function(x, y) (x + investment) * y, rep(apr, year), initial,
accumulate = TRUE)
data.frame(year = 0:year, value)
}
calc_fun(3, 1.05, 10, 10)
# year value
#1 0 10.0000
#2 1 21.0000
#3 2 32.5500
#4 3 44.6775
Using for loop we can do
calc_fun1 <- function(years,apr,initial,investment) {
value <- numeric(years + 1)
value[1] <- initial
for (i in 1:years) value[i + 1] <- (value[i] + investment) * apr
data.frame(year = 0:year, value)
}

Related

Function in R that performs multiple operations over columns of two datasets

I have two datasets, each with 5 columns and 10,000 rows. I want to calculate y from values in columns between the two datasets, column 1 in data set 1 and column 1 in data set 2; then column 2 in data set 1 and column 2 in data set 2. The yneeds nonetheless to follow a set of rules before being calculated. What I did so far doesn't work, and I cannot figure it out why and if there is a easier way to do all of this.
Create data from t-distributions
mx20 <- as.data.frame(replicate(10000, rt(20,19)))
mx20.50 <- as.data.frame(replicate(10000, rt(20,19)+0.5))
Calculates the mean for each simulated sample
m20 <- apply(mx20, FUN=mean, MARGIN=2)
m20.05 <- apply(mx20.50, FUN=mean, MARGIN=2)
The steps 1 and 2_ above are repeated for five sample sizes from t-distributions rt(30,29); rt(50,49); rt(100,99); and rt(1000,999)
Bind tables (create data.frame) for each t-distribution specification
tbl <- cbind(m20, m30, m50, m100, m1000)
tbl.50 <- cbind(m20.05, m30.05, m50.05, m100.05, m1000.05)
Finally, I want to calculate the y as specified above. But here is where I get totally lost. Please see below my best attempt so far.
y = (mtheo-m0)/(m1-m0), where y = 0 when m1 < m0 and y = y when m1 >= m0. mtheo is a constant (e.g. 0.50), m1 is value in column 1 of tbl and m0 is value in column 1 of tbl.50.
ycalc <- function(mtheo, m1, m0) {
ifelse(m1>=m0) {
y = (mteo-m0)/(m1-m0)
} ifelse(m1<m0) {
y=0
} returnValue(y)
}
You can try this. I used data frames instead of data tables.
This code is more versatile. You can add or remove parameters. Below are the parameters that you can use to create t distributions.
params = data.frame(
n = c(20, 30, 50, 100, 1000),
df = c(19, 29, 49, 99, 999)
)
And here is a loop that creates the values you need for each t distribution. You can ignore this part if you already have those values (or code to create those values).
tbl = data.frame(i = c(1:10000))
tbl.50 = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
mx = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])))
m <- apply(mx, FUN=mean, MARGIN=2)
tbl = cbind(tbl, m)
names(tbl)[ncol(tbl)] = paste("m", params[i, 1], sep="")
mx.50 = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])+.5))
m.50 <- apply(mx.50, FUN=mean, MARGIN=2)
tbl.50 = cbind(tbl.50, m.50)
names(tbl.50)[ncol(tbl.50)] = paste("m", params[i, 1], ".50", sep="")
}
tbl = tbl[-1]
tbl.50 = tbl.50[-1]
And here is the loop that does the calculations. I save them in a data frame (y). Each column in this data frame is the result of your function applied for all rows.
mtheo = .50
y = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
y$dum = 0
idx = which(tbl[, i] >= tbl.50[, i])
y[idx, ]$dum =
(mtheo - tbl.50[idx, i]) /
(tbl[idx, i] - tbl.50[idx, i])
names(y)[ncol(y)] = paste("y", params[i, 1], sep="")
}
y = y[-1]
You could try this, if the first column in tbl is called m0 and the first column in tbl.50 is called m1:
mteo <- 0.5
ycalc <- ifelse(tbl$m1 >= tbl.50$m0, (mteo - tbl.50$m0)/(tbl$m1 - tbl.50$m0),
ifelse(tbl$m1 < tbl.50$m0), 0, "no")
Using the same column names provided by your code, and transforming your matrices into dataframes:
tbl <- data.frame(tbl)
tbl.50 <- data.frame(tbl.50)
mteo <- 0.5
ycalc <- ifelse(tbl$m20 >= tbl.50$m20.05, (mteo - tbl.50$m20.05)/(tbl$m20 - tbl.50$m20.05),
ifelse(tbl$m20 < tbl.50$m20.05, "0", "no"))
This results in:
head(ycalc)
[1] "9.22491706576716" "0" "0" "0" "0" "1.77027049630147"

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

Using nested for loops in Fitness Function in Genetic Algroithims makes it too slow

Im trying to use Genetic Algorithims using "GA" Package but faced a problem in making the fitness function, im using GA to simulate my data and get the most fitted values for constants in my model.
My data is from observations for a car speed and other parameters, so let's say i've a car and it made a 2 trips, and i want to make a model for it.
Each trip have multiple columns ( speed, delta velocity with the opposite car, and Range between the two cars ), so i've to take the first row of each trip and pass it to the equations in fitness function, then the equations will generate new results for the speed,delta velocity and the range, then i've to use the new values and generate others, then compare the simulated distance with the old range i've in my data which is the observed one and get the lowest difference by the GA .
First: here's my data.
https://drive.google.com/open?id=1923Jl6pDnQa_tGAluANUfIWCcyf85YVq
Second: here's my fitness function and the GA
Fitness_Function <- function(data, M_Acc, D_Speed, Beta, Com_Acc, Gap_J, D_Time){
Trips_IDs <- sort(unique(data$FileName))
# Trip=1;ROW=1
Calibrated_DF <- data.frame()
for (Trip in 1:2) {
Trip_Data <- data%>%filter(FileName==Trips_IDs[Trip])
attach(Trip_Data, warn.conflicts=F)
for (ROW in 1:(nrow(Trip_Data)-1)) {
if (ROW==1) {
speed <- Filling_Speed[1]
Delta_V <- Filling_DeltaVelocity[1]
Dist <- Filling_Range[1]
# M_Acc = 0.8418 ;D_Speed =29.2 ;Beta = 3.52
# Com_Acc = 0.8150 ;Gap_J = 1.554 ;D_Time = 0.878
Distance <- speed*D_Time - (speed*Delta_V)/(2*sqrt(M_Acc*Com_Acc))
if (Distance < 0 ) {
Distance <- 0
}
D_Gap <- Gap_J + Distance
Acceleration <- M_Acc*(1-(speed/D_Speed)^Beta-(D_Gap/Dist)^2)
}else{
speed <- speed_C
Delta_V <- Delta_V_C
Dist <- Dist_c
Distance <- speed*D_Time - (speed*Delta_V)/(2*sqrt(M_Acc*Com_Acc))
if (is.na(Distance)) {
}
Distance = 0
if (Distance < 0 ) {
Distance <- 0
}
D_Gap <- Gap_J + Distance
Acceleration <- M_Acc*(1-(speed/D_Speed)^Beta-(D_Gap/Dist)^2)
}
Lead_Veh_Speed_F <- Filling_Speed[ROW+1]+Filling_DeltaVelocity[ROW+1]
speed_C <- speed + Acceleration*0.1
Delta_V_C <- Lead_Veh_Speed_F-speed_C
Dist_c <- Dist+(Delta_V_C+Delta_V)/2*0.1
Calibrated_DF <- rbind(Calibrated_DF,c(Dist_c,ROW+1,Trips_IDs[Trip],Trip_Data$Filling_Range[ROW+1]))
}
detach(Trip_Data)
}
colnames(Calibrated_DF) <- c("C_Distance","row","Trip","Actual_Distance")
Calibrated_DF$Dif <- (Calibrated_DF$C_Distance-Calibrated_DF$Actual_Distance)^2
RMSPE <- sqrt(sum(Calibrated_DF$Dif)/sum(Calibrated_DF$Actual_Distance^2))
return(RMSPE)
# return(Calibrated_DF)
}
GA_Test <- ga(type='real-valued', lower=c( 0.1 , 1 , 1 , 0.1 , 0.1 , 0.1 ),
upper=c( 5 , 40 , 40 , 5 , 10 , 5 ), popSize=300, maxiter=300,run = 100,
keepBest=T, names = c("M_Acc", "D_Speed", "Beta", "Com_Acc", "Gap_J", "D_Time"),
fitness = function(b) -Fitness_Function(data, b[1],b[2], b[3],b[4],b[5],b[6]))
my problem is that: the code is very large, and it's veeeery slow to do even one iteration, i tried to use dplyr instead of using for loops but it's impossible to do that with dplyr, because i've to calculate the distance then acceleration then speed, then calculate them again for the other rows and i couldn't find away to do that with dplyr.
I'll post my beta code of using Dplyr here but it's not complete because i can't complete it.
So help please.
data <- data%>%group_by(Driver,FileName)%>%
mutate(Distance_Term = ifelse(row_number()==1,Speed_C*D_Time - (Speed_C*Delta_V_C)/(2*sqrt(M_Acc*Com_Acc)),0))
data <- data%>%mutate(Distance_Term = ifelse(Distance_Term < 0 , 0, Distance_Term))%>%
mutate(D_Gap = Gap_J + Distance_Term,Acceleration_C = M_Acc*(1-(Speed_C/D_Speed)^Beta-(D_Gap/Distance)^2))
Note: the FileName column in the trip ID also my PC has good qualifications, so the problem isn't in my PC
I've changed the for loop with accumulate2 function in purrr so it's more faster and more efficient, i got this answer from this question Calculate variables using equations then use the generated values to generate new one
Objective_Function <- function(data, M_Acc, D_Speed, Beta, Com_Acc, Gap_J, D_Time){
myfun <- function(list, lcs,lcs2){
ds <- lcs - list[[1]]
Distance <- list[[1]]*D_Time - (list[[1]] * ds) / (2*sqrt(M_Acc*Com_Acc))
if (Distance < 0|is.na(Distance)) {Distance <- 0}
gap <- Gap_J + Distance
acc <- M_Acc * (1 - (list[[1]] / D_Speed)^Beta - (gap / list[[2]])^2)
fcs_new <- list[[1]] + acc * 0.1
ds_new <- lcs2- fcs_new
di_new <- list[[2]]+(ds_new+ds)/2*0.1
return(list(Speed = fcs_new,Distance = di_new))
}
Generated_Data <- data %>%group_by(Driver,FileName)%>%
mutate(Speed_Distance_Calibrated = accumulate2( .init = list(Filling_Speed[1],
Filling_Range[1]),.x = Lead_Veh_Speed_F,.y = Lead_Veh_Speed_F2, myfun)[-1])%>%ungroup()
Generated_Data <- Generated_Data %>% group_by(Driver,FileName)%>%
mutate(Speed_Distance_Calibrated = append(list(list(Speed = Filling_Speed[1],Distance = Filling_Range[1])),Speed_Distance_Calibrated[-length(Speed_Distance_Calibrated)]))%>%ungroup()
Dif <- map_df(Generated_Data$Speed_Distance_Calibrated, `[`, 2)
Generated_Data <- Generated_Data %>% mutate(Dif_sq = (Dif$Distance - Generated_Data$Filling_Range)^2)
RMSPE <- sqrt(sum(Generated_Data$Dif_sq)/sum(Generated_Data$Filling_Range^2))
return(RMSPE)
}
GA_Test <- ga(type='real-valued', lower=c( 0.1 , 1 , 1 , 0.1 , 0.1 , 0.1 ),
upper=c( 5 , 40 , 40 , 5 , 10 , 5 ), popSize=300, maxiter=300,run = 100,
keepBest=T, names = c("M_Acc", "D_Speed", "Beta", "Com_Acc", "Gap_J", "D_Time"),
fitness = function(b) -Objective_Function(data, b[1],b[2], b[3],b[4],b[5],b[6]),parallel = TRUE)
Summary <- summary(GA_Test)

repeated forecasts gives same values

I have a monthly dataset of performance (in terms of %) of different sectors in a company in the form
Date |Sector |Value
2016-01-01 |Sect 1 |-20
2016-02-01 |Sect 1 |10
2016-01-01 |Sect 2 |23
2016-02-01 |Sect 1 |10
the data has 20 Sectors and monthly data till June 2018. Now I want to forecast Value for the next month. I used the below code:
combine_ts <- function(data, h=1, frequency= 12, start= c(2016,5),
end=c(2018,6))
{
results <- list()
sectgrowthsub <- data[!duplicated(sectgrowthdf2[,2]),]
sectgrowthts <- ts(sectgrowthsub[,3], frequency = frequency, start = start,
end = end)
for (i in 1:(nrow(sectgrowthsub))) {
results[[i]] <- data.frame(Date =
format(as.Date(time(forecast(auto.arima(sectgrowthts), h)$mean)), "%b-%y"),
SectorName = rep(sectgrowthsub[,2], h),
PointEstimate = forecast(auto.arima(sectgrowthts),
h=h)$mean[i])
}
return(data.table::rbindlist(results))
}
fore <- combine_ts(sectgrowthsub)
The problem in this case is that Value forecast is the same for all the Sectors.
Help is much appreciated
I took the liberty of simplifying the problem a little bit and removed the function to better show the process of modeling groups separately:
library(magrittr)
library(forecast)
dat <- data.frame(value = c(rnorm(36, 5),
rnorm(36, 50)),
group = rep(1:2, each = 36))
# make a list where each element is a group's timeseries
sect_list <- dat %>%
split(dat$group) %>%
lapply(function(x, frequency, start) {
ts(x[["value"]], frequency = 12, start = 1 ) })
# then forecast on each groups timeseries
fc <- lapply(sect_list, function(x) { data.frame(PointEstimate = forecast(x, h=1)$mean ) }) %>%
do.call(rbind, .) # turn into one big data.frame
fc
PointEstimate
1 5.120082
2 49.752510
Let me know if you get hung up on any parts of this.

Calculate a weighted sum from a timeseries with irregular spacing based on a function

Given a dataframe containing a timeseries with irrgularly spaced intervals, defined as:
df <- data.frame(date = as.Date("2016-01-01") + ((1:100) + sample(1:5, 100, replace = TRUE)),
data = rnorm(100) )
How can I calculate a rolling sum of the data column over the previous 30 days, with weights defined by this decay function?
tau <- 0.05
decay = function(tau, day){
exp(-tau * day)
}
The current day's data then has a weight of 1 and the data from 30 days ago has a weight of decay(0.05, 30) = 0.2231302. Missing days from the input time series should still be accounted for in computing the weights using the decay function.
If possible, I would like to convert the data frame to a zoo or xts object and then use the rollapplyr function or similar, and to do this with dplyr pipes.
Define a function weighted that takes the last 30 points and from those only keeps the points within 30 days of the last one. Then using those it multiplies that by the weights.
In the pipeline we convert df to zoo and then use rollapplyr with weighted. Note that it is important that we use coredata = FALSE so that the time index is passed to weighted. Without that it would not be.
library(dplyr)
library(zoo)
weighted <- function(x, tau) {
tx <- time(x)
cx <- coredata(x)[tx > tail(tx, 1) - 30] # only keep if within 30 days
w <- decay(tau, seq(to = 0, by = -1, length = length(cx)) )
sum(w * cx)
}
df %>%
read.zoo %>%
rollapplyr(30, weighted, tau = tau, partial = TRUE, coredata = FALSE)
If you want to treat missing days as 0 then use this instead:
weighted <- function(x, tau) {
tx <- as.numeric(time(x))
days <- tail(tx, 1) - tx
w <- (days < 30) * decay(tau, days)
sum(w * coredata(x))
}
Note
We have used the following input modified from the question by adding set.seed for reproducibility. Also the code used in the question might by chance give rise to multiple values with the same date and we eliminated such duplicates.
set.seed(123)
df <- data.frame(date = as.Date("2016-01-01") + 1:100 + sample(1:5, 100, replace = TRUE),
data = rnorm(100) )
df <- df[!duplicated(df$date), ]
tau <- 0.05
decay = function(tau, day){
exp(-tau * day)
}
I am not sure about pipes, but this should get you going:
d <- decay(tau, 29:0)
rollapply(df, 30, function(z) {
data <- as.data.frame(z, stringsAsFactors = FALSE)
data$data <- as.numeric(data$data)
sum(data$data * d)
}, by.column = FALSE)

Resources