How to simulate daily stock returns in R - r

I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)

You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.

I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101

To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.

P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5

Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67

Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')

Related

Plotting distribution of variances

My dataset has 2 fields:
Time stamp t --- Varies between 0 to 60
Variable x – variance in value of a variable (say, A) from t-1 to t. Varies between -100% to 100%
There are roughly 500 records for each value of time stamp- e.g.
500 records where t= 0 and x takes any value between -100% to 100%
490 records where t= 1 and x takes any value between -100% to 100%, and so on.
Note, the value of x is 0 for ~80% of the records
The aim here is to determine at what value of t (Can be one value, or a range, e.g., when t= 22, or is between 20 -25), is the day-on-day change in A the minimum: Which effectively translates to finding out t when x is very frequently= 0, and when not, is at least close to zero.
To this purpose, I aim to plot the variance of x for each day. I can think of using a violin plot with x (Y axis) and t (X-axis), but there being 60 values of t makes it difficult to plot all in one chart. Can you suggest any alternative plot for the intended visual analysis?
Does it help if you do the absolute value of the variance (so its concentrated in 0-100) and trying with logs in here? https://stats.stackexchange.com/questions/251066/boxplot-for-data-with-a-large-number-of-zero-values.
When you say smallest, you mean closest to 0, right? In this case its better to work to reduce absolute variance (on a 0-1 scale), as you can then treat this like zero-inflated binomial data e.g. with the VGAM package: https://rdrr.io/cran/VGAM/man/zibinomial.html
I've had a play around, and below is an example that I think makes sense. I've only had some experience with zero-inflated models, so would be good if anyone has some feedback :)
library(ggplot2)
library(data.table)
library(VGAM)
# simulate some data
N_t <- 60 # number of t
N_o <- 500 # number of observations at t
t_smallest <- 30 # best value
# simulate some data crudely
set.seed(1)
dataL <- lapply(1:N_t, function(t){
dist <- abs(t_smallest-t)+10
values <- round(rbeta(N_o, 10/dist, 300/dist), 2) * sample(c(-1,1), N_o, replace=TRUE)
data.table(t, values)
})
data <- rbindlist(dataL)
# raw
ggplot(data, aes(factor(t), values)) + geom_boxplot() +
coord_cartesian(ylim=c(0, 0.1))
# log transformed - may look better with your data
ggplot(data, aes(factor(t), log(abs(values)+1))) +
geom_violin()
# use absolute values, package needs it as integer p & n, so approximate these
data[, abs.values := abs(values)]
data[, p := round(1000*abs.values, 0)]
data[, n := 1000]
# with a gam, so smooth fit on t. Found it to be unstable though
fit <- vgam(cbind(p, n-p) ~ s(t), zibinomialff, data = data, trace = TRUE)
# glm, with a coefficient for each t, so treats independently
fit2 <- vglm(cbind(p, n-p) ~ factor(t), zibinomialff, data = data, trace = TRUE)
# predict
output <- data.table(t=1:N_t)
output[, prediction := predict(fit, newdata=output, type="response")]
output[, prediction2 := predict(fit2, newdata=output, type="response")]
# plot out with predictions
ggplot(data, aes(factor(t), abs.values)) +
geom_boxplot(col="darkgrey") +
geom_line(data=output, aes(x=t, y=prediction2)) +
geom_line(data=output, aes(x=t, y=prediction), col="darkorange") +
geom_vline(xintercept = output[prediction==min(prediction), t]) +
coord_cartesian(ylim=c(0, 0.1))

R: How would I repeatedly simulate how many attempts before a success on a 1/10 chance? (and record how many attempts it took?)

R and probability noob here. I'm looking to create a histogram that shows the distribution of how many attempts it took to return a heads, repeated over 1000+ simulated runs on the equivalent of an unfairly weighted coin (0.1 heads, 0.9 tails).
From my understanding, this is not a geometric distribution or binomial distribution (but might make use of either of these to create the simulated results).
The real-world (ish) scenario I am looking to model this for is a speedrun of the game Zelda: Ocarina of Time. One of the goals in this speedrun is to obtain an item from a character that has a 1 in 10 chance of giving the player the item each attempt. As such, the player stops attempting once they receive the item (which they have a 1/10 chance of receiving each attempt). Every run, runners/viewers will keep track of how many attempts it took to receive the item during that run, as this affects the time it takes the runner to complete the game.
This is an example of what I'm looking to create:
(though with more detailed labels on the x axis if possible). In this, I manually flipped a virtual coin with a 1/10 chance of heads over and over. Once I got a successful result I recorded how many attempts it took into a vector in R and then repeated about 100 times - I then mapped this vector onto a histogram to visualise what the distribution would look like for the usual amount of attempts it will take to get a successful result - basically, i'd like to automate this simulation instead of me having to manually flip the virtual unfair coin, write down how many attempts it took before heads, and then enter it into R myself).
I'm not sure if this is quite what you're looking for, but if you create a function for your manual coin flipping, you can just use replicate() to call it many times:
foo <- function(p = 0.1) {
i <- 0
failure <- TRUE
while ( failure ) {
i <- i + 1
if ( sample(x = c(TRUE, FALSE), size = 1, prob = c(p, 1-p)) ) {
failure <- FALSE
}
}
return(i)
}
set.seed(42)
number_of_attempts <- replicate(1000, foo())
hist(number_of_attempts, xlab = "Number of Attempts Until First Success")
As I alluded to in my comment though, I'm not sure why you think the geometric distribution is inappropriate.
It "is used for modeling the number of failures until the first success" (from the Wikipedia on it).
So, we can just sample from it and add one; the approaches are equivalent, but this will be faster when your number of samples is high:
number_of_attempts2 <- rgeom(1000, 0.1) + 1
hist(number_of_attempts2, xlab = "Number of Attempts Until First Success")
I would use the 'rle' function since you can make a lot of simulations in a short period of time. Use this to count the run of tails before a head:
> n <- 1e6
> # generate a long string of flips with unfair coin
> flips <- sample(0:1,
+ n,
+ replace = TRUE,
+ .... [TRUNCATED]
> counts <- rle(flips)
> # now pull out the "lengths" of "0" which will be the tails before
> # a head is flipped
> runs <- counts$lengths[counts$value == 0]
> sprintf("# of simulations: %d max run of tails: %d mean: %.1f\n",
+ length(runs),
+ max(runs),
+ mean(runs))
[1] "# of simulations: 90326 max run of tails: 115 mean: 10.0\n"
> ggplot()+
+ geom_histogram(aes(runs),
+ binwidth = 1,
+ fill = 'blue')
and you get a chart like this:
Histograph of runs
I would tabulate the cumsum.
p=.1
N <- 1e8
set.seed(42)
tosses <- sample(0:1, N, T, prob=c(1-p, p))
attempts <- tabulate(cumsum(tosses))
length(attempts)
# [1] 10003599
hist(attempts, freq=F, col="#F48024")

How to generate a population of random numbers within a certain exponentially increasing range

I have 16068 datapoints with values that range between 150 and 54850 (mean = 3034.22). What would the R code be to generate a set of random numbers that grow in frequency exponentially between 54850 and 150?
I've tried using the rexp() function in R, but can't figure out how to set the range to between 150 and 54850. In my actual data population, the lambda value is 25.
set.seed(123)
myrange <- c(54850, 150)
rexp(16068, 1/25, myrange)
The call produces an error.
Error in rexp(16068, 1/25, myrange) : unused argument (myrange)
The hypothesized population should increase exponentially the closer the data values are to 150. I have 25 data points with a value of 150 and only one with a value of 54850. The simulated population should fall in this range.
This is really more of a question for math.stackexchange, but out of curiosity I provide this solution. Maybe it is sufficient for your needs.
First, ?rexp tells us that it has only two arguments, so we generate a random exponential distribution with the desired length.
set.seed(42) # for sake of reproducibility
n <- 16068
mr <- c(54850, 150) # your 'myrange' with less typing
y0 <- rexp(n, 1/25) # simulate exp. dist.
y <- y0[order(-y0)] # sort
Now we need a mathematical approach to rescale the distribution.
# f(x) = (b-a)(x - min(x))/(max(x)-min(x)) + a
y.scaled <- (mr[1] - mr[2]) * (y - min(y)) / (max(y) - min(y)) + mr[2]
Proof:
> range(y.scaled)
[1] 150.312 54850.312
That's not too bad.
Plot:
plot(y.scaled, type="l")
Note: There might be some mathematical issues, see therefore e.g. this answer.

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

I am trying to simulate monthly panels of data where one variable depends on lagged values of that variable in R. My solution is extremely slow. I need around 1000 samples of 2545 individuals, each of whom is observed monthly over many years, but the first sample took my computer 8.5 hours to construct. How can I make this faster?
I start by creating an unbalanced panel of people with different birth dates, monthly ages, and variables xbsmall and error that will be compared to determine the Outcome. All of the code in the first block is just data setup.
# Setup:
library(plyr)
# Would like to have 2545 people (nPerson).
#Instead use 4 for testing.
nPerson = 4
# Minimum and maximum possible ages and birth dates
AgeMin = 10
AgeMax = 50
BornMin = 1950
BornMax = 1963
# Person-specific characteristics
ind =
data.frame(
id = 1:nPerson,
BornYear = floor(runif(length(1:nPerson), min=BornMin, max=BornMax+1)),
BornMonth = ceiling(runif(length(1:nPerson), min=0, max=12))
)
# Make an unbalanced panel of people over age 10 up to year 1986
# panel = ddply(ind, ~id, transform, AgeMonths = BornMonth)
panel = ddply(ind, ~id, transform, AgeMonths = (AgeMin*12):((1986-BornYear)*12 + 12-BornMonth))
# Set up some random variables to approximate the data generating process
panel$xbsmall = rnorm(dim(panel)[1], mean=-.3, sd=.45)
# Standard normal error for probit
panel$error = rnorm(dim(panel)[1])
# Placeholders
panel$xb = rep(0, dim(panel)[1])
panel$Outcome = rep(0, dim(panel)[1])
Now that we have data, here is the part that is slow (around a second on my computer for only 4 observations but hours for thousands of observations). Each month, a person gets two draws (xbsmall and error) from two different normal distributions (these were done above), and Outcome == 1 if xbsmall > error. However, if Outcome equals 1 in the previous month, then Outcome in the current month equals 1 if xbsmall + 4.47 > error. I use xb = xbsmall+4.47 in the code below (xb is the "linear predictor" in a probit model). I ignore the first month for each person for simplicity. For your information, this is simulating a probit DGP (but that is not necessary to know to solve the problem of computation speed).
# Outcome == 1 if and only if xb > -error
# The hard part: xb includes information about the previous month's outcome
start_time = Sys.time()
for(i in 1:nPerson){
# Determine the range of monthly ages to loop over for this person
AgeMonthMin = min(panel$AgeMonths[panel$id==i], na.rm=T)
AgeMonthMax = max(panel$AgeMonths[panel$id==i], na.rm=T)
# Loop over the monthly ages for this person and determine the outcome
for(t in (AgeMonthMin+1):AgeMonthMax){
# Indicator for whether Outcome was 1 last month
panel$Outcome1LastMonth[panel$id==i & panel$AgeMonths==t] = panel$Outcome[panel$id==i & panel$AgeMonths==t-1]
# xb = xbsmall + 4.47 if Outcome was 1 last month
# Otherwise, xb = xbsmall
panel$xb[panel$id==i & panel$AgeMonths==t] = with(panel[panel$id==i & panel$AgeMonths==t,], xbsmall + 4.47*Outcome1LastMonth)
# Outcome == 1 if xb > 0
panel$Outcome[panel$id==i & panel$AgeMonths==t] =
ifelse(panel$xb[panel$id==i & panel$AgeMonths==t] > - panel$error[panel$id==i & panel$AgeMonths==t], 1, 0)
}
}
end_time = Sys.time()
end_time - start_time
My thoughts for reducing computer time:
Something with cumsum()
Some wonderful panel data function that I do not know about
Find a way to make the t loop go through the same starting and ending points for each individual and then somehow use plyr::ddpl() or dplyr::gather_by()
Iterative solution: make an educated guess about the value of Outcome at each monthly age (say, the mode) and somehow adjust values that do not match the previous month. This would work better in my real application because xbsmall has a very clear trend in age.
Do the simulation only for smaller samples and then estimate the effect of sample size on the values I need (the distributions of regression coefficient estimates not calculated here)
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Handling NA and NAN in R

I am attempting to run a simple simulation of 100,000 instances for the code below. When attempting to get the sd and mean of of dlogP I am receiving sd(dlogP):NA and mean(dlogP): NaN. I believe i should be getting a sd deviation similar to that of the original rnorm of 5% Can someone help me out as what I am doing incorrectly? I have attempted to adjust the number of iterations which seems to work, but I need to generate 100,000 instances. Thanks in advance.
set.seed(2013)
P_1 <- 100 # Initial price of stock
r <- rnorm(100000, .01, .05) # Generating 100,000 instances
P <- P_1*cumprod(1+r)
set.seed(2013)
logP<- log(P)
dlogP <-log1p(P)-logP # The change in logs from t+1 and t
dlogP
head(dlogP,1) # Will output the first value of the matrix
sd(dlogP)
mean(dlogP)
plot(P)
If you insist on 1% daily return, you could do it in log scale, without touching Inf.
set.seed(2013)
P_1 <- 100 # Initial price of stock
r <- rnorm(100000, .01, .05) # Generating 100,000 instances
logP <- log(P_1) + cumsum(log(1+r))
dlogP <-diff(logP) # The change in logs from t+1 and t
#dlogP
head(dlogP,1) # Will output the first value of the matrix
sd(dlogP)
mean(dlogP)

Resources