Handling NA and NAN in R - r

I am attempting to run a simple simulation of 100,000 instances for the code below. When attempting to get the sd and mean of of dlogP I am receiving sd(dlogP):NA and mean(dlogP): NaN. I believe i should be getting a sd deviation similar to that of the original rnorm of 5% Can someone help me out as what I am doing incorrectly? I have attempted to adjust the number of iterations which seems to work, but I need to generate 100,000 instances. Thanks in advance.
set.seed(2013)
P_1 <- 100 # Initial price of stock
r <- rnorm(100000, .01, .05) # Generating 100,000 instances
P <- P_1*cumprod(1+r)
set.seed(2013)
logP<- log(P)
dlogP <-log1p(P)-logP # The change in logs from t+1 and t
dlogP
head(dlogP,1) # Will output the first value of the matrix
sd(dlogP)
mean(dlogP)
plot(P)

If you insist on 1% daily return, you could do it in log scale, without touching Inf.
set.seed(2013)
P_1 <- 100 # Initial price of stock
r <- rnorm(100000, .01, .05) # Generating 100,000 instances
logP <- log(P_1) + cumsum(log(1+r))
dlogP <-diff(logP) # The change in logs from t+1 and t
#dlogP
head(dlogP,1) # Will output the first value of the matrix
sd(dlogP)
mean(dlogP)

Related

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Confidence Interval of Sample Means using R

My dataframe contains sampling means of 500 samples of size 100 each. Below is the snapshot. I need to calculate the confidence interval at 90/95/99 for mean.
head(Means_df)
Means
1 14997
2 11655
3 12471
4 12527
5 13810
6 13099
I am using the below code but only getting the confidence interval for one row only. Can anyone help me with the code?
tint <- matrix(NA, nrow = dim(Means_df)[2], ncol = 2)
for (i in 1:dim(Means_df)[2]) {
temp <- t.test(Means_df[, i], conf.level = 0.9)
tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")
For any single mean, e. g. 14997, you can not compute a 95%-CI without knowing the variance or the standard deviation of the data, the mean was computed from. If you have access to the standard deviation of each sample, you can than compute the standard error of the mean and with that, easily the 95%-CI. Apparently, you lack the Information needed for the task.
Means_df is a data frame with 500 rows and 1 column. Therefore
dim(Means_df)[2]
will give the value 1.
Which is why you only get one value.
Solve the problem by using dim(Means_df)[1] or even better nrow(Means_df) instead of dim(Means_df)[2].

How to simulate daily stock returns in R

I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.
I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101
To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5
Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67
Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')

Monte Carlo simulation in R

I am trying to simulate data (Y) from an AR(1) model with rho=0.7. Then I will use this data to run a regression of Y on an intercept ( by so doing the parameter estimate becomes the mean of Y), then test the null hypothesis of the coefficient being less than or equal to zero ( alternative is greater than 0) using robust standard errors.
I want to run a Monte Carlo simulation of this hypothesis using 2000 replications for different lag values. the purpose is to show the finite sample performance of the Newey West estimator as the lag changes. so this is how I began
A<-array(0, dim=c(2000,1))
for(i in 1:2000){
y_new<-arima.sim(model=list(ar=0.7), n=50, mean=0,sd=1)
reg<-lm(y_new~1)
ad<-coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag=1, prewhite=FALSE))
A[i]<-ad[,3]
}
My question: is the code above the right way of doing this kind of simulation? And if it is, how can I get a code to repeat this process for different lag values in the HAC test. I want to run the test each time increasing the lag by 1, thus I will be doing this 50 times for lags 1,2,3,4......,50, each time storing the 2000 simulated test statistics in a vector with different names. calculate rejection probabilities for the test statistic (sig. level =0,05, using the critical value of 1.645) for each case and plot them(rejection probabilities) against the various lag values.
Please help
Because you didn't mention the possible purpose of the simulation, it is hard to tell whether it is the right way.
You save a lot of time by computing 50 test statistics for each simulated sample, instead of repeating the simulation 2000 times for each lag (that is, the number of simulation is 2000*50).
Much better format of doing simulation is
library(AER)
library(dplyr)
lags <- 1:50
nreps <- 2000
sim <- function (){
ynew <- arima.sim(model = list(ar=0.7), n=50, mean=0, sd=1)
reg <- lm(ynew ~ 1 )
s <- rep(NA, 50)
for(i in lags){
ad <- coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag = i, prewhite=FALSE))
s[i] <- ad[ ,4]
}
s
}
Following code stores simulation results in a data.frame
result <- lapply(1:nreps, function(i)data.frame(simulation = i, lag = lags, pvalues = sim())) %>%
rbind_all
From your vague description, I extrapolate what you want looks something like
library(ggplot2)
result %>%
group_by(lag) %>%
summarize(rejectfreq = mean(pvalues > 0.05)) %>%
ggplot(., aes(lag, rejectfreq)) + geom_line()+
coord_cartesian(ylim = c(0,1)) +
scale_y_continuous(breaks=seq(0, 1, by=0.1))
Although the figure was created using only 100 simulations, it is evident that the choice of the lags in Newey-West wouldn't matter much when the disturbance terms are i.i.d.

Storing For Loop values after simulation

I'm brand new to R and trying to implement a simple model (which I will extend later) that deals with corporate bond defaults.
For starters, I'm using only two clients.
Parameters:
- two clients (which I name "A" and "B")
- a cash flow of $10,000 will be received from each client if they do not default within 10 years
- pulling together concepts using standard normal random variables, dependent uniform random variables and Gaussian copulas
- run some number of simulations
- store the sum of Client A cash flow plus Client B cash flow and store in a vector named "result"
- finally, take the average of the result vector
My code is:
# define variables
nSim <- 5 # of simulations
rho <- 0.3 # rho
lambda <- 0.01 # default intensity
T <- 10 # time to default
for (i in 1:nSim){
# Step 1: generate 2 independent standard normal random variables
z1 <- rnorm(1, mean=0, sd=1)
z2 <- rnorm(1, mean=0, sd=1)
# Step 2: map the normals into correlated normals
# by Cholesky composition of the correlation matrix
# w1 = z1
# w2 = rho(z1)+sqrt(1-(rho^2))*z2
w1 <- z1
w2 <- rho*z1 - sqrt(1-(rho^2))*z2
# Step 3: using the correlated normals, generate two dependent uniform variables
u <- runif(1, min=0, max=1)
v <- runif(1, min=0, max=1)
# Step 4: using the dependent uniforms, generate two dependent exponentials
tau.A <- (-1/lambda)*log(u)
tau.B <- (-1/lambda)*log(v)
payout.A <- if (tau.A > 10) {10000} else {0}
payout.B <- if (tau.B > 10) {10000} else {0}
result[i] = (payout.A[i] + payout.B[i])
}
# calculate expected value of portfolio
mean(result)
When I run this code, I'm getting an error of "NA" and can't figure out why (again, I'm brand new to R). I don't think each of the simulation values is being stored in the results vector, but don't know how to diagnose the problem.
Thanks in advance to anyone who can help!
--Sarah
Everything works until the results[i] <- (payout.A[i] + payout.B[i]) line. The problem is you never set results.
Before your for loop, add the line:
results <- vector('numeric', length = nSim)
This will create a vector of 0s with a length of nSim. In R is is best to preallocate the space instead of dynamically growing a vector using c().
No the problem is the presence of the [i] assignments in the results[i] <- (payout.A[i] + payout.B[i]) line.
The [i] assignment is okay for the results parameter but not the two payout parameters because each of these are being generated in each loop. So simply remove them to form the line:
results[i] <- (payout.A + payout.B)
will solve your issue. If you wish to keep each payout in its own vector then you need to assign it as such, but it seems that you don't.

Resources