Populate a variable instance once calculations in a previous row are finished - r

I want to calculate the final number of unprocessed claims after each month. First, I calculate a total number of claims to process: it's a last month's backlog + any new_claims coming in the current month. Then, in order to calculate a number of close_claims I multiply that number by a closed_total_ratio and take that away from the total. My final variable is open_claims which should automatically feed into backlog as soon as it's calculated:
library(tidyverse)
set.seed(1)
df <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, rep(NA, 5)),
new_claims =sample(c(10,20,30), 6, replace = T),
closed_open_ratio = rep(0.2, 6),
open_claims = rep(NA, 6))
df
set.seed(1)
solution <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, 104, 99, 95, 100, 88),
new_claims =sample(c(10,20,30), 6, replace = T),
total = c(130, 124, 119, 125, 110, 118),
closed_total_ratio = rep(0.2, 6),
closed = c(26, 25, 24, 25,22,24),
open_claims = c(104, 99, 95, 100,88, 94)
)
solution
The thing is, if I apply something like this:
df %>%
mutate(total = backlog[1] +cumsum(new_claims),
closed = closed_open_ratio* total,
open_claims = total - cumsum(closed)) %>%
select(backlog, new_claims, total, closed_open_ratio, closed, open_claims)
I fail to move open_claims back to the backlog. What would be a better way of doing it?

Cześć Kasiu! I think we can't avoid iteration, if result in the next row depends on result from the previous one. You wrote "I'll be iterating over big data frames" so the best way to save some time is to use Rcpp. You need to create new "C++ File" (it's integrated with RStudio) with the following code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame forKasia(DataFrame DF) {
IntegerVector backlog = DF["backlog"];
IntegerVector new_claims = DF["new_claims"];
NumericVector closed_open_ratio = DF["closed_open_ratio"];
IntegerVector open_claims = DF["open_claims"];
int n = backlog.size();
IntegerVector total = IntegerVector(n);
IntegerVector closed = IntegerVector(n);
for (int i = 0; i < n; i++) {
total[i] = backlog[i] + new_claims[i];
closed[i] = round(total[i] * closed_open_ratio[i]);
open_claims[i] = total[i] - closed[i];
if (i < n - 1) {
backlog[i + 1] = open_claims[i];
}
}
return DataFrame::create(
_["date"] = DF["date"],
_["backlog"] = backlog,
_["new_claims"] = new_claims,
_["total"] = total,
_["closed_open_ratio"] = closed_open_ratio,
_["closed"] = closed,
_["open_claims"] = open_claims
);
}
Then source it and run:
Rcpp::sourceCpp('forKasia.cpp')
forKasia(df)

Not very elegant, but I think it works. There are some differences with your solution, but also the new_claims column is different:
df <- tibble(date = seq(from = lubridate::as_date('2018-01-01'), to = lubridate::as_date('2018-06-01'), by = 'months'),
backlog = c(120, rep(NA, 5)),
new_claims =sample(c(10,20,30), 6, replace = T),
closed_open_ratio = rep(0.2, 6),
open_claims = rep(NA, 6))
df <- data.frame(df)
for (i in 1:nrow(df)) {
df$open_claims[i] <- (df$backlog[i] + df$new_claims[i]) - ((df$backlog[i] df$new_claims[i]) * df$closed_open_ratio[i])
if (i < nrow(df)) {
df$backlog[i + 1] <- (df$backlog[i] + df$new_claims[i]) - ((df$backlog[i] + df$new_claims[i]) * df$closed_open_ratio[i])
}
}
df
date backlog new_claims closed_open_ratio open_claims
1 2018-01-01 120.0000 10 0.2 104.00000
2 2018-02-01 104.0000 20 0.2 99.20000
3 2018-03-01 99.2000 10 0.2 87.36000
4 2018-04-01 87.3600 20 0.2 85.88800
5 2018-05-01 85.8880 30 0.2 92.71040
6 2018-06-01 92.7104 20 0.2 90.16832
Hope it helps.

You can use purrr::accumulate to carry forward the still opened claims starting from the backlog on Day 1. cumsum and cumprod are the most common examples of this type of computation but in this case we need something more complex than cumsum because a proportion of the claims get closed every day.
Let p be the probability of closing (a constant). Let q=1-p be the probability of not closing.
For Day 1 we have backlog + new_claims claims. Let's call the
total x1. Then at the end of Day 1 we have q*x1 claims still
open.
Then for Day 2 we have the previously open claims, q*x1,
plus some new ones, x2, and at the end of Day 2 we have q*(q*x1 + x2) claims still open. Let's look at one more day to make it clear.
For Day 3 we have the previously open claims plus those received
that day and at the end of Day 3 we have q*(q*(q*x1 + x2) + x3) claims still open.
This is the kind of sequential computation we can perform with purrr::accumulate.
p_close <- 0.2
df %>%
# Not very elegant but need to add backlog to the first-day claims
mutate(new_claims = if_else(row_number() == 1,
new_claims + backlog, new_claims)) %>%
# This function computes p*(p*(p*(p*x1 + x2) + x3) + x4) .....
mutate(tot_claims = accumulate(new_claims, function(x, y) (1-p_close)*x + y)) %>%
# Simple vectorized product to split the total claims into open and closed
mutate(open_claims = (1-p_close) * tot_claims,
closed_claims = p_close * tot_claims) %>%
# The backlog is the previous days opened claims
mutate(backlog = if_else(row_number() == 1,
backlog, lag(open_claims)))
The above computation assumes that the probability p_close of closing a claim is the same every day. But you can work with purrr::accumulate2 to provide both a vector of claims and a vector of closing probabilities.
This accumulation is a little more complex so let's define it separately.
accumulate_claims <- function(new_claims, closed_open_ratio) {
f <- function(x, y, p) (1-p)*x + y
# head(p, -1) drops the last probability. We actually don't need it here
# as right now we are computing the sequential sums of previously opened
# claims + the new claims for the day
x <- accumulate2(new_claims, head(closed_open_ratio, -1), f)
unlist(x)
}
df %>%
# Not very elegant but need to add backlog to the first-day claims
mutate(new_claims = if_else(row_number() == 1, new_claims + backlog, new_claims)) %>%
# This function computes p4*(p3*(p2*(p1*x1 + x2) + x3) + x4) .....
mutate(tot_claims = accumulate_claims(new_claims, closed_open_ratio)) %>%
# Simple vectorized product to split the total claims into open and closed
mutate(open_claims = (1-closed_open_ratio) * tot_claims,
closed_claims = closed_open_ratio * tot_claims) %>%
# The backlog is the previous days opened claims
mutate(backlog = if_else(row_number() == 1, backlog, lag(open_claims)))

One way to tackle the sequential nature of the computation is with recursion, calculating the inital open claims using the first row of the dataset, and then repeating the call using the remaining rows.
calc_open_claims <- function(current_backlog, new_claims, closed_open_ratio) {
(current_backlog + new_claims) * (1 - closed_open_ratio)
}
open_claims <- function(weekly_changes, accumulator) {
if (nrow(weekly_changes) == 0) return(accumulator)
new_backlog <- calc_open_claims(last(accumulator),
weekly_changes$new_claims[1],
weekly_changes$closed_open_ratio[1])
accumulator = c(accumulator, new_backlog)
open_claims(weekly_changes[-1, ], accumulator)
}
open_claims(df, 120)
# Wrapper to kick it off and align result
open_claims_wrapper = function(df) {
starting_backlog <- df$backlog[1]
oc <- open_claims(df, starting_backlog) # starting_backlog seeds the accumulator
oc <- oc[-1] # lop off the starting backlog
mutate(df, open_claims = oc)
}
open_claims_wrapper(df)

Related

While Loops and Midpoints

Recently, I learned how to write a loop that initializes some number, and then randomly generates numbers until the initial number is guessed (while recording the number of guesses it took) such that no number will be guessed twice:
# https://stackoverflow.com/questions/73216517/making-sure-a-number-isnt-guessed-twice
all_games <- vector("list", 100)
for (i in 1:100){
guess_i = 0
correct_i = sample(1:100, 1)
guess_sets <- 1:100 ## initialize a set
trial_index <- 1
while(guess_i != correct_i){
guess_i = sample(guess_sets, 1) ## sample from this set
guess_sets <- setdiff(guess_sets, guess_i) ## remove it from the set
trial_index <- trial_index + 1
}
## no need to store `i` and `guess_i` (as same as `correct_i`), right?
game_results_i <- data.frame(i, trial_index, guess_i, correct_i)
all_games[[i]] <- game_results_i
}
all_games <- do.call("rbind", all_games)
I am now trying to modify the above code to create the following two loops:
(Deterministic) Loop 1 will always guess the midpoint (round up) and told if their guess is smaller or bigger than the correct number. They will then re-take the midpoint (e.g. their guess and the floor/ceiling) until they reach the correct number.
(Semi-Deterministic) Loop 2 first makes a random guess and is told if their guess is bigger or smaller than the number. They then divide the difference by half and makes their next guess randomly in a smaller range. They repeat this process many times until they reach the correct number.
I tried to write a sketch of the code:
#Loop 2:
correct = sample(1:100, 1)
guess_1 = sample(1:100, 1)
guess_2 = ifelse(guess_1 > correct, sample(50:guess_1, 1), sample(guess_1:100, 1))
guess_3 = ifelse(guess_2 > correct, sample(50:guess_2, 1), sample(guess_2:100, 1))
guess_4 = ifelse(guess_4 > correct, sample(50:guess_3, 1), sample(guess_3:100, 1))
#etc
But I am not sure if I am doing this correctly.
Can someone please help me with this?
Thank you!
Example : Suppose I pick the number 68
Loop 1: first random guess = 51, (100-51)/2 + 51 = 75, (75-50)/2 + 50 = 63, (75 - 63)/2 + 63 = 69, (69 - 63)/2 + 63 = 66, etc.
Loop 2: first random guess = 53, rand_between(53,100) = 71, rand_between(51,71) = 65, rand(65,71) = 70, etc.
I don't think you need a for loop for this, you can create structures since the beginning, with sample, sapply and which:
## correct values can repeat, so we set replace to TRUE
corrects <- sample(1:100, 100, replace = TRUE)
## replace is by default FALSE in sample(), if you don't want repeated guesses
## sapply() creates a matrix
guesses <- sapply(1:100, function(x) sample(1:100, 100))
## constructing game_results_i equal to yours, but could be simplified
game_results_i <- data.frame(
i = 1:100,
trial_index = sapply(
1:100,
function(x) which(
## which() returns the index of the first element that makes the predicate true
guesses[, x] == corrects[x]
)
),
guess_i = corrects,
correct_i = corrects # guess_i and correct_i are obviously equal
)
Ok, let's see if now I match question and answer properly :)
If I got correctly your intentions, in both loops, you are setting increasingly finer lower and upper bounds. Each guess reduces the search space. However, this interpretation does not always match your description, please double check if it can be acceptable for your purposes.
I wrote two functions, guess_bisect for the deterministic loop_1 and guess_sample for loop_2:
guess_bisect <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- round((ub - lb) / 2) + lb
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- round((ub - lb) / 2) + lb
trial_index <- trial_index + 1
}
trial_index
}
guess_sample <- function(correct, n = 100) {
lb <- 0
ub <- n + 1
trial_index <- 1
guess <- sample((lb + 1):(ub - 1), 1)
while (guess != correct) {
# cat(lb, ub, guess, "\n") # uncomment to print the guess iteration
if (guess < correct)
lb <- guess
else
ub <- guess
guess <- sample((lb + 1):(ub - 1), 1)
trial_index <- trial_index + 1
}
trial_index
}
Obviously, guess_bisect always produces the same results with the same input, guess_sample changes randomly instead.
By plotting the results in a simple chart, it seems that the deterministic bisection is on the average much better, as the random sampling may become happen to pick improvements from the wrong sides. x-axis is the correct number, spanning 1 to 100, y-axis is the trial index, with guess_bisect you get the red curve, with many attempts of guess_sample you get the blue curves.

Efficient indexing / joining in data.table across multiple dependent conditions for stop detection algorithm

Edit: Real data set available here
With thanks to
Wang, Rui, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. "StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students using Smartphones." In Proceedings of the ACM Conference on Ubiquitous Computing. 2014.
Explanation
I'm running a simulation study in which I am performing stop detection on location data (lat/lon coordinates) based on relatively simple criteria.
A location (A) is a stop if there exists another location (B) with a timestamp of at least 180 seconds after A, and if all locations between A and B inclusive have a distance from A less than or equal to 80 meters.
I've tried to reduce the data such that it still works but doesn't require actual coordinates.
data <- data.table(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
latlon = c(0, 50, 80, 90, 90, 100, 190, 110, 110, 110),
time = c(0, 60, 120, 180, 240, 300, 360, 420, 480, 520))
id 1 isn't a stop, because the first location with a difference in time > 180 (id 5) has a distance in latlon of 90.
id 2 is a stop, because all locations between itself and the first location with a difference in time > 180 (id 6) have a distance less than 80 (0, 30, 40, 40, 50).
id 6 is not a stop because even though id 10 is > 180 difference in time, id 7 which falls between has a distance greater than 80.
id 8 is not a stop because there is no location afterwards at least 180 seconds following.
Ultimately, I need to be able to assign "stop id" greedily such if I find for example that id 2 has points that satisfy the distance requirement through until id 7, locations with ids 2:7 have a stop id of 2.
Matrix and for loop
If I run this:
nrows <- nrow(data)
latlon_dist <- outer(data$latlon, data$latlon, `-`)
latlon_dist[upper.tri(latlon_dist)] <- NA
time_window <- outer(data$time, data$time, `-`)
time_window[upper.tri(time_window)] <- NA
foo <- function(x){
mindist <- min(which(x[, 1] > 80), nrows)
if (mindist >= min(which(x[, 2] > 180), nrows + 1)) mindist else NA
}
bar <- array(c(latlon_dist, time_window),
dim = c(nrows, nrows, 2))
apply(bar, 2, foo)
It gives me back the threshholds > NA 7 7 NA NA NA NA NA NA NA which I can use in a for loop to set the stop id as appropriate.
threshholds <- apply(bar, 2, foo) - 1
previous_threshhold <- 0
for (i in seq_along(threshholds)) {
current_threshhold <- threshholds[i]
if (!is.na(current_threshhold) && current_threshhold > previous_threshhold) {
data[i:current_threshhold, stop_id := i]
previous_threshhold <- current_threshhold
}
}
At this point, this is the only way I've been able to guarantee accuracy. Everything else I've tried I've thought was correct, only to find that it wasn't behaving identically to this situation. But this is, as you might imagine, horribly inefficient, and it's being run 116,000 times in my simulation study.
My assumption is that the best way to handle this is with a non-equi join in data.table.
The other implementation I'm currently running functions better when the number of rows in the data set make the array too memory-heavy. I won't translate this to work on the data, but it's here in case it gives anyone ideas. I've stuck it in a while loop so it can skip over some iterations when it's already assigned a stop_id to a number of points. If points 1:7 all belong to stop_id 1, they aren't considered candidate stops themselves, we just move on to testing again at point 8. It technically returns a different solution, but stops that are "close enough" are merged later in this process, so the final result is unlikely to differ much.
For loop, no matrix
stopFinder <- function(dt){
nrows <- nrow(dt)
if (nrows < 20000){
return(quickStopFinder(dt))
}
i <- 1
remove_indices <- 0
while (i < nrows) {
this_ends <- dt[!remove_indices,
Position(
geodist_vec(rep(longitude[1], .N),
rep(latitude[1], .N),
longitude,
latitude,
paired = TRUE),
f = function(x) x > 80,
nomatch = .N + 1) ] + i - 1
# A) Do some number of points occur within the distance?
# B) If so, is it at least three minutes out?
if (this_ends > (i + 1) && dt[c(i, (this_ends - 1)), timestamp[.N] > time_window[1]]) {
# Last index is the one before the distance is broken
last_index_of_stop <- this_ends - 1
# Next run, we will remove all prior considerations
remove_indices <- c(1:last_index_of_stop)
# Set the point itself
dt[i,
`:=`(candidate_stop = TRUE,
stop_id = id,
within_stop = TRUE)]
# Set the attached points
dt[(i + 1):last_index_of_stop,
`:=`(within_stop = TRUE,
stop_id = i)]
# Start iterating again on the point that broke the distance
i <- this_ends
} else {
# If no stop, move on and leave out this point
remove_indices <- c(1:i)
i <- i + 1
}
}
dt[]
}
quickStopFinder is more or less the implementation I share at the beginning, which is memory intensive and slow, but slightly less slow than stopFinder.
Previously, I had something like this as the basis, but it required a lot of subsequent steps and didn't always give me the results I was looking for, but I'll add it for posterity.
res <- dt[dt,
on = .(timestamp >= timestamp_dup,
timestamp <= time_window)]
res[, dist := geodist_vec(x1 = longitude,
y1 = latitude,
x2 = i.longitude,
y2 = i.latitude,
paired = TRUE,
measure = "haversine")]
res[, candidate_stop := all(dist <= 80), i.id]
New with real data
Edit with example from real data:
This handles the situation with joins, but grows too big too quickly. It is fast when the data are small.
sm2 <- read.csv(file = "http://daniellemc.cool/sm.csv", row.names = NULL)
sm <- copy(sm2)
setDT(sm)
sm <- sm[, .(timestamp, longitude, latitude, id)]
sm[, timestamp := as.POSIXct(timestamp)]
sm[, id2 := id]
# This is problematic on my data because of how quickly it grows.
test <- sm[sm, on = .(id >= id)]
test[, i.id2 := NULL]
setnames(test, c("time.2", "longitude.2", "latitude.2", "id.1",
"id.2", "time.1", "longitude.1", "latitude.1"))
# Time and distance differences calculated between each pair
test[, distdiff := geodist_vec(longitude.1, latitude.1,
longitude.2, latitude.2,
paired = TRUE)]
test[, timediff := time.2 - time.1]
# Include the next distance to make sure there's at least one within distance and
# over 180 timediff.
test[, nextdistdiff := shift(distdiff, -1), id.1]
# Are all distances within 180 sec within 80, and is the next following also < 80
test[, dist_met := FALSE]
test[timediff < 180, dist_met := all(distdiff < 80 & nextdistdiff < 80), id.1]
test[, dist_met := any(dist_met), id.1]
# Test how many occur consecutively
# This keeps us from having > 80 dist but then coming back within 80
test[, consecutive := FALSE]
test[distdiff < 80, consecutive := c(TRUE, cummin(diff(id.2) == 1) == 1), id.1]
test[consecutive == TRUE & dist_met == TRUE, stop_id := min(id.1), id.2]
test[test[consecutive == TRUE & dist_met == TRUE], stop_id := i.stop_id, on = .(id.1 = id.2)]
test <- unique(test[, .(stop_id, id.1)])
# Join it back to the data.
sm[test, stop_id := stop_id, on = .(id = id.1)]
Using non-equi joins capabilities of data.table, you could join the data to itself while avoiding a cartesian product which would be too expensive.
As data.table only allows >, < or =, joins are first done on rectangular areas, before filtering out the appropriate distances. On the real data you provided, this makes 7 times less calculations.
library(data.table)
library(geosphere)
data <- copy(sm)
minduration <- 180
maxdistance <- 80
data[,latmin := destPoint(cbind(longitude,latitude),b = 180, d=maxdistance)[,2]]
data[,latmax := destPoint(cbind(longitude,latitude),b = 0 , d=maxdistance)[,2]]
data[,lonmin := destPoint(cbind(longitude,latitude),b = 270, d=maxdistance)[,1]]
data[,lonmax := destPoint(cbind(longitude,latitude),b = 90, d=maxdistance)[,1]]
data[,timestampmin := timestamp+minduration]
# Cross product with space and time windows
cross <- data[data,.(i.id,x.id,i.latitude,i.longitude,x.latitude,x.longitude,dist = distGeo(cbind(x.longitude,x.latitude),cbind(i.longitude,i.latitude)) ,i.timestamp,x.timestamp)
,on=.(timestamp>timestampmin,
longitude >= lonmin,
longitude<=lonmax,
latitude >= latmin,
latitude <= latmax)][
dist<maxdistance]
# Summarizing the results
cross[,.(keep=cumsum(fifelse(diff(x.id-i.id)==1,1,NA_integer_))),by=i.id][
!is.na(keep),.(startid = i.id,nextid = i.id+keep)][
!(startid %in% nextid)][
,.(maxid=max(nextid)),by=startid][
,.(stopid = min(startid)),by=maxid]
maxid stopid
1: 6 1
2: 18 10
3: 26 22
4: 33 28
5: 48 40
---
162: 4273 4269
163: 4276 4274
164: 4295 4294
165: 4303 4301
166: 4306 4305

How to create a summation function with data frame in R?

Just for fun, I am trying to create a basic savings calculator. My current code is:
value <- function(years,apr,initial,investment) {
df <- as.data.frame(matrix(nrow = years, ncol = 2))
colnames(df) <- c("year","value")
df$years <- c(1:years)
for (i in 1:years) {
current_value <-(last_value+investment)*apr
}
#repeating calculation for the data frame
print(df)
What I am trying to do is have the calculator create a table that displays the value each year. I've adapted my code from an old homework assignment, so I am not concerned with how to make the data frame. However, I do not know how to make the formula for the summation.
I am trying to model
Current Value = (Cumulative Value + Investment)*(Annual Percentage Rate)
As an example, let's say initial value is 10, investment is 10, and the APR is 1.05
(10+10)*(1.05)=21
(21+10)*(1.05)=32.55
(32.55+10)*(1.05)=44.68
and so on.
Year is there to number the rows accordingly.
We can use Reduce with accumulate = TRUE
calc_fun <- function(years,apr,initial,investment) {
value <- Reduce(function(x, y) (x + investment) * y, rep(apr, year), initial,
accumulate = TRUE)
data.frame(year = 0:year, value)
}
calc_fun(3, 1.05, 10, 10)
# year value
#1 0 10.0000
#2 1 21.0000
#3 2 32.5500
#4 3 44.6775
Using for loop we can do
calc_fun1 <- function(years,apr,initial,investment) {
value <- numeric(years + 1)
value[1] <- initial
for (i in 1:years) value[i + 1] <- (value[i] + investment) * apr
data.frame(year = 0:year, value)
}

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

Apply a function to each row of a matrix without using lapply function in R

I have a input data frame with multiple rows. For each row, I want to apply a function. The input data frame has 1,000,000+ rows. How can I speed up the part using lapply ? I would like to avoid the apply family of functions as in Efficient way to apply function to each row of data frame and return list of data frames because these methods seem to be slow in my case.
Here is a reproducible example with a simple function:
library(tictoc) # enable use of tic() and toc() to record time taken for test to compute
func <- function(coord, a, b, c){
X1 <- as.vector(coord[1])
Y1 <- as.vector(coord[2])
X2 <- as.vector(coord[3])
Y2 <- as.vector(coord[4])
if(c == 0) {
res1 <- mean(c((X1 - a) : (X1 - 1), (Y1 + 1) : (Y1 + 40)))
res2 <- mean(c((X2 - a) : (X2 - 1), (Y2 + 1) : (Y2 + 40)))
res <- matrix(c(res1, res2), ncol=2, nrow=1)
} else {
res1 <- mean(c((X1 - a) : (X1 - 1), (Y1 + 1) : (Y1 + 40)))*b
res2 <- mean(c((X2 - a) : (X2 - 1), (Y2 + 1) : (Y2 + 40)))*b
res <- matrix(c(res1, res2), ncol=2, nrow=1)
}
return(res)
}
## Apply the function
set.seed(1)
n = 10000000
tab <- as.matrix(data.frame(x1 = sample(1:100, n, replace = T), y1 = sample(1:100, n, replace = T), x2 = sample(1:100, n, replace = T), y2 = sample(1:100, n, replace = T)))
tic("test 1")
test <- do.call("rbind", lapply(split(tab, 1:nrow(tab)),
function(x) func(coord = x,
a = 40,
b = 5,
c = 1)))
toc()
## test 1: 453.76 sec elapsed
This seems like a good opportunity to refactor and make this in a vectorized calculation, which R can solve faster. (TL;DR: this makes it about 1000x faster.)
It looks like the task here is to take a weighted average of two ranges of integers, where the bookends of the ranges vary by row (based on X1, X2, Y1, and Y2), but the sequences are the same length in each row. This helps, because it means we can use algebra to simplify the calculation.
For the simple case that a = 40, the first sequence will be from x1-40 to x-1, and from y+1 to y1+40. The mean will be the sum of these two divided by 80. The sum will be 40*X1 + 40*Y1 + sum of (-40:-1) + sum of (1:40), and those last two terms cancel out. So you can simply output the average of each pair of columns, multiplied by b.
library(dplyr)
b = 5
quick_test <- tab_tbl %>%
as_data_frame() %>%
mutate(V1 = (x1+y1)/2 * b,
V2 = (x2+y2)/2 * b)
Using n = 1E6 (10% of OP), the OP function takes 73 seconds. The function above takes 0.08 seconds and has the same output.
For the cases where a != 40, it takes a little more algebra. V1 here ends up as a weighted average, where we're adding up the sequence (x1-a):(x1-1) and the sequence (y1+1):(y1+40), all divided by a+40 (since there are a terms in the x1 sequence and 40 terms in the y1 sequence. We don't actually need to add up this sequence; we could convert it to a shorter calculation using algebra: https://en.wikipedia.org/wiki/Arithmetic_progression
sum of (x1-a):(x1-1) = x1*a + sum of (-a:-1) = x1*a + a*(-a + -1)/2 = x1*a - (a*a + a)/2
That all means we can fully replicate the code for any positive a using:
a = 50
b = 5
tictoc::tic("test 2b")
quick_test2 <- quick_test <- tab %>%
as_data_frame() %>%
mutate(V1 = (a*x1 - (a*a + a)/2 + 40*y1 + 820)/(a+40)*b,
V2 = (a*x2 - (a*a + a)/2 + 40*y2 + 820)/(a+40)*b)
tictoc::toc()
This is about 1000x faster. With n = 1E6, a = 41, b = 5, c = 1, the OP solution took 154 seconds on my 2012 laptop, while quick_test2 above took 0.23 sec and had identical results.
(Small addendum, you could add a test to set b = 1 if c == 0, and then you've taken care of the if-else condition.)
Based on Jon Spring answer, we can do the same with base R:
test2 <- function(d, a, b, c) {
if (c == 0) b <- 1
X <- d[, c('x1', 'x2')]
Y <- d[, c('y1', 'y2')]
(a*X - (a*a + a)/2 + 40*Y + 820)/(a+40)*b
}
res2 <- test2(tab, 40, 5, 1)
Looks like some already very fast options. Another slow option would be a standard for-loop.
This is much slower than theirs, but still 3 times faster than the lapply.
n = 1e6
tic("test 2")
test <- vector("list", nrow(tab))
for (i in 1:nrow(tab)) {test[[i]] <- func(coord = tab[i,], a = 40, b = 5, c = 1)
}
testout <- do.call(rbind, test)
toc()
> test 2: 3.85 sec elapsed
I suggest looking up the tidyverse, in this case specifically dplyr (a tidyverse sub-package). The tidyverse is a huge collection of useful and "tidy" (aka, FAST) operations. Once you go tidy, you never go back.
First, just some general math advice. Taking an average of a sequence can be done without actually generating the entire sequence. You just need the start and end of the sequence, as the average of the first and last number is the same as the average of the entire sequence. If your real data is a vector of non-sequential numbers let me know. The following three lines of code are a proof that the mean of the first and last number are the same as the mean of the full sequence:
seqstart <- sample(1:50, 1, replace = T)
seqend <- sample(51:100, 1, replace = T)
mean(c(seqstart, seqend)) == mean(seqstart:seqend)
If you don't believe me, paste those 3 lines into your consule until you find a FALSE value, or until you believe me. :)
library(tidyverse)
set.seed(1)
n = 10000000
tab <- data.frame(x1 = sample(1:100, n, replace = T), y1 = sample(1:100, n,
replace = T), x2 = sample(1:100, n, replace = T), y2 = sample(1:100, n, replace =
T))
Notice I am not using a matrix yet. You can recreate your matrix later. If you are starting with a matrix for some reason, honestly I would just change it to a normal table for this so I can use tidy operations more easily. Maybe a guru can teach us how to use tidyverse operations on matrices, I don't know how. Solution:
tic("test 1")
a <- 40
b <- 5
test <- tab %>% mutate(c = 1) %>%
mutate(res1 = if_else(c==1,(((x1 - a)+(x1 - 1)+(y1 + 1)+(y1 + 40))/4)*b,(((x1 - a)+
(x1 - 1)+(y1 + 1)+(y1 + 40))/4))) %>%
mutate(res2 = if_else(c==1,(((x2 - a)+(x2 - 1)+(y2 + 1)+(y2 + 40))/4)*b,(((x2 - a)+
(x2 - 1)+(y2 + 1)+(y2 + 40))/4)))
test %>% select(res1,res2) -> test
toc()
test 1: 8.91 sec elapsed
Fast enough for me.
Please note I made a new column with mutate called "c" and set it to 1. This is because dplyr doesn't like it if you use if_else statements that have logical checks against an environmental variable (and if that variable is always 1, why would we code this in the first place?). Thus, I am assuming that you are planning to use a "c" that can sometimes be 1 and sometimes be 0, and I am proposing here that you should have that data in a column that we can reference.
#Jon Spring has provided a really good answer above.
However, I am suggesting a method which is using {data.table}.
test2 <- data.table(copy(tab))
tic("test2")
a <- 40
b <- 5
c <- 1
test2[, Output1 := (x1*a - 0.5*(a + a^2) + 40 * y1 + 820)/ (a + 40) * b]
test2[, Output2 := (x2*a - 0.5*(a + a^2) + 40 * y2 + 820)/ (a + 40) * b]
toc()
This method takes time of around 0.4 to 3.28 seconds on my laptop, when n = 1e7.
For n = 1e6, the method you posted in question takes around 138 seconds, while the method I used takes about 0.3 seconds.

Resources