Simulated Maximum Likelihood in R, MaxLik - r

I am trying to estimate a model by simulated maximum likelihood via the MaxLik package in R. Unfortunately, with increasing data size, I am running into serious performance problems. Can anyone advice about the following:
Is there a way to speed up my code (it's already vectorized, so I am kind of clueless how to improve it further)?
Is there a way to implement the optimization process via Rcpp in order to speed it up?
Is there any smarter way to implement simulated maximum likelihood with a custom made likelihood function?
I have already tried doParallel on an AWS instance, but that does not significantly speed up the process.
I have created a reproducable example and commented the most important parts:
#create data:
#Binary DV (y), 10 IDV (V3 - V12), 50 groups (g), with 100 sequential observations each (id)
set.seed(123)
n <- 5000
p <- 10
x <- matrix(rnorm(n * p), n)
g <- rep(seq(1:(n/100)),each=100)
id <- rep(seq(1:(n/max(g))),max(g))
beta <- runif(p)
xb <- c(x %*% beta)
p <- exp(xb) / (1 + exp(xb))
y <- rbinom(n, 1, p)
data <- as.data.table(cbind(id,y,x,g))
#Find starting values for MaxLik via regular glm
standard <-
glm(
y ~
V3 +
V4 +
V5 +
V6 +
V7 +
V8 +
V9 +
V10 +
V11 +
V12,
data = data,
family = binomial(link = "logit")
)
summary(standard)
#set starting values for MaxLik
b <- c(standard$coefficients,sd_V3=0.5,sd_V4=0.5)
#draw 50 x # of groups random values from a normal distribution
draws <- 50
#for each g in the data, 50 randomvalues are drawn
rands <- as.data.table(cbind(g=rep(g,each=draws),matrix(rnorm(length(g)*draws,0,1),length(g)*draws,2)))
colnames(rands) <- c("g","SD_V3","SD_V4")
#merge random draws to each group, so every observation is repeated x # of draws
data <- merge(data,rands,by="g",all=T,allow.cartesian=T)
#the likelihood function (for variables V3 and V4, a mean [b3] & b[4] and a SD b[12] & b[14] is estimated
loglik1 <- function(b){
#I want the standard deviations to vary only across groups (g), but all other parameters to vary across all observations, which is why I am taking the mean across g and id (remember, every observation is a cartesian product with the random draws per group)
ll <- data[,.(gll=mean(((1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 +
(b[3]+b[13]*SD_V4)*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12))))^y)*
(1-(1/(1+exp(-(b[1]+
(b[2])*V3 +
(b[3])*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12)))))^(1-y))),by=.(g,id)]
return(log(ll[,gll]))
}
co <- maxLik::maxControl(gradtol=1e-04,printLevel=2)
maxlik <- maxLik::maxLik(loglik1,start=b,method="bfgs",control=co)
summary(maxlik)
Thank you very much for your advice

I was able to decrease optimization time dramatically (hours to minutes) by changing the inside of loglik1 <- function(b){ ... } to
return(data[,.(g,id,y,logit=1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 +
(b[3]+b[13]*SD_V4)*V4 +
(b[4])*V5 +
(b[5])*V6 +
(b[6])*V7 +
(b[7])*V8 +
(b[8])*V9 +
(b[9])*V10 +
(b[10])*V11 +
(b[11])*V12))))][,mean(y*log(logit)+(1-y)*log(1)-logit),by=.(g,id)][,sum(V1)])
However, this does only partially solve the problem, since with increasing data size, the estimation time increases once again :(
I will probably have to deal with this, unless someone has an elegant solution?
EDIT: To pick up on this after a while, in case anyone has the problem in the future... The reason, the script takes very long, lies in the package MaxLik and the computation time to derive the Hessian matrix. If you don’t need that, you can tell MaxLik not to compute it. Since I do need it, I decided to compute it via Rcpp.

Related

Plotting distribution of variances

My dataset has 2 fields:
Time stamp t --- Varies between 0 to 60
Variable x – variance in value of a variable (say, A) from t-1 to t. Varies between -100% to 100%
There are roughly 500 records for each value of time stamp- e.g.
500 records where t= 0 and x takes any value between -100% to 100%
490 records where t= 1 and x takes any value between -100% to 100%, and so on.
Note, the value of x is 0 for ~80% of the records
The aim here is to determine at what value of t (Can be one value, or a range, e.g., when t= 22, or is between 20 -25), is the day-on-day change in A the minimum: Which effectively translates to finding out t when x is very frequently= 0, and when not, is at least close to zero.
To this purpose, I aim to plot the variance of x for each day. I can think of using a violin plot with x (Y axis) and t (X-axis), but there being 60 values of t makes it difficult to plot all in one chart. Can you suggest any alternative plot for the intended visual analysis?
Does it help if you do the absolute value of the variance (so its concentrated in 0-100) and trying with logs in here? https://stats.stackexchange.com/questions/251066/boxplot-for-data-with-a-large-number-of-zero-values.
When you say smallest, you mean closest to 0, right? In this case its better to work to reduce absolute variance (on a 0-1 scale), as you can then treat this like zero-inflated binomial data e.g. with the VGAM package: https://rdrr.io/cran/VGAM/man/zibinomial.html
I've had a play around, and below is an example that I think makes sense. I've only had some experience with zero-inflated models, so would be good if anyone has some feedback :)
library(ggplot2)
library(data.table)
library(VGAM)
# simulate some data
N_t <- 60 # number of t
N_o <- 500 # number of observations at t
t_smallest <- 30 # best value
# simulate some data crudely
set.seed(1)
dataL <- lapply(1:N_t, function(t){
dist <- abs(t_smallest-t)+10
values <- round(rbeta(N_o, 10/dist, 300/dist), 2) * sample(c(-1,1), N_o, replace=TRUE)
data.table(t, values)
})
data <- rbindlist(dataL)
# raw
ggplot(data, aes(factor(t), values)) + geom_boxplot() +
coord_cartesian(ylim=c(0, 0.1))
# log transformed - may look better with your data
ggplot(data, aes(factor(t), log(abs(values)+1))) +
geom_violin()
# use absolute values, package needs it as integer p & n, so approximate these
data[, abs.values := abs(values)]
data[, p := round(1000*abs.values, 0)]
data[, n := 1000]
# with a gam, so smooth fit on t. Found it to be unstable though
fit <- vgam(cbind(p, n-p) ~ s(t), zibinomialff, data = data, trace = TRUE)
# glm, with a coefficient for each t, so treats independently
fit2 <- vglm(cbind(p, n-p) ~ factor(t), zibinomialff, data = data, trace = TRUE)
# predict
output <- data.table(t=1:N_t)
output[, prediction := predict(fit, newdata=output, type="response")]
output[, prediction2 := predict(fit2, newdata=output, type="response")]
# plot out with predictions
ggplot(data, aes(factor(t), abs.values)) +
geom_boxplot(col="darkgrey") +
geom_line(data=output, aes(x=t, y=prediction2)) +
geom_line(data=output, aes(x=t, y=prediction), col="darkorange") +
geom_vline(xintercept = output[prediction==min(prediction), t]) +
coord_cartesian(ylim=c(0, 0.1))

R generating binomial Random variables from exponential random variables

I have 100000 exponential random variables generated withrexp and I am asked to generate 100000 binomial random variables from them using built in R functions.
I really don't know how can I generate one random variable from another. I searched some resources on internet but they were mostly about generating poisson from exponential which are very related because exponential distribution can be interpreted as time intervals of poisson. making poisson can be easily achieved by applying cumsum on exponentials and using cut function to make some bins including number of occurrences in a time interval.
But I don't know how is it possible to generate binomial from exponential.
The function rbin below generates binomial rv's from exponential rv's. The reason why might be a question for CrossValidated, not for StackOverflow, which is about code.
rbin <- function(n, size, p){
onebin <- function(i, size, thres){
I <- 0L
repeat{
S <- sum(rexp(I + 1)/(size + 1 - seq_len(I + 1)))
if(S > thres) break
I <- I + 1L
}
I
}
thres <- -log(1 - p)
sapply(seq_len(n), onebin, size, thres)
}
set.seed(1234)
u <- rbin(100000, 1, 0.5)
v <- rbinom(100000, 1, 0.5)
X <- cbind(u, v)
cbind(Mean = colMeans(X), Var = apply(X, 2, var))
# Mean Var
#u 0.50124 0.2500010
#v 0.49847 0.2500002

Calculating probabilities of simulated random variables in R

I have the following graph:
I need to travel from A to B. I also assume that I am taking the fastest route from A to be every day.
The travel times (in hours) between the nodes are exponentially distributed. I have simulated them, with the relevant lambda values, in R as follows:
AtoX <- rexp(1000, 4)
AtoY <- rexp(1000, 2.5)
XtoY <- rexp(1000, 10)
YtoX <- rexp(1000, 10)
XtoB <- rexp(1000, 3)
YtoB <- rexp(1000, 5)
I calculated the average travel time everyday in R as follows:
AXB <- AtoX + XtoB
AYB <- AtoY + YtoB
AXYB <- AtoX + XtoY + YtoB
AYXB <- AtoY + YtoX + XtoB
TravelTimes <- pmin(AXB, AYB, AXYB, AYXB)
averageTravelTime <- mean(TravelTimes)
I'm now trying to find the following for every single day:
With which probability is each of the four possible routes from A to B taken?
What is the probability that I have to travel more than half an hour?
For (1), I understand that I need to take the cumulative distribution function (CDF) P(x <= X) for each route.
For (2), I understand that I need to take the cumulative distribution function (CDF) P(0.5 => X), where 0.5 denotes half an hour.
I have only just started learning R, and I am unsure of how to go about doing this.
Reading the documentation, it seem that I might need to do something like the following to calculate the CDF:
pexp()
1 - pexp()
How can I do this?
Let R1, R2, R3, R4 be, in some order, random variables corresponding to the total time of the four routes. Then, being sums of independent exponential random variables, each of them follows the Erlang or the Gamma distribution (see here).
To answer 1, you want to find P(min{R1, R2, R3, R4} = R_i) for i=1,2,3,4. While the minimum of independent exponential random variables is tractable (see here), as far as I know that is not the case with Erlang/Gamma distributions in general. Hence, I believe you need to answer this question numerically, using simulations.
The same applies to the second question requiring to find P(min{R1, R2, R3, R4} >= 1/2).
Hence, we have
table(apply(cbind(AXB, AYB, AXYB, AYXB), 1, which.min)) / 1000
# 1 2 3 4
# 0.312 0.348 0.264 0.076
and
mean(TravelTimes >= 0.5)
# [1] 0.145
as our estimates. By increasing 1000 to some higher number (e.g., 1e6 works fast) one could make those estimates more precise.

GRG Non-Linear Least Squares (Optimization)

I am trying to convert an Excel spreadsheet that involves the solver function, using GRG Non-Linear to optimize 2 variables that return the lowest sum of squared errors. I have 4 known times (B) at 4 known distances(A). I need to create an optimization function to find what interaction of values for Vmax and Tau produce the lowest sum of squared errors. I have looked at the nls function and nloptr package but can't quite seem to piece them together. Current values for Vmax and Tau are what was determined via the excel solver function, just need to replicate in R. Any and all help would be greatly appreciated. Thank you.
A <- c(0,10, 20, 40)
B <- c(0,1.51, 2.51, 4.32)
Measured <- as.data.frame(cbind(A, B))
Corrected <- Measured
Corrected$B <- Corrected$B + .2
colnames(Corrected) <- c("Distance (yds)", "Time (s)")
Corrected$`X (m)` <- Corrected$`Distance (yds)`*.9144
Vmax = 10.460615006988
Tau = 1.03682513806393
Predicted_X <- c(Vmax * (Corrected$`Time (s)`[1] - Tau + Tau*exp(-Corrected$`Time (s)`[1]/Tau)),
Vmax * (Corrected$`Time (s)`[2] - Tau + Tau*exp(-Corrected$`Time (s)`[2]/Tau)),
Vmax * (Corrected$`Time (s)`[3] - Tau + Tau*exp(-Corrected$`Time (s)`[3]/Tau)),
Vmax * (Corrected$`Time (s)`[4] - Tau + Tau*exp(-Corrected$`Time (s)`[4]/Tau)))
Corrected$`Predicted X (m)` <- Predicted_X
Corrected$`Squared Error` <- (Corrected$`X (m)`-Corrected$`Predicted X (m)`)^2
#Sum_Squared_Error <- sum(Corrected$`Squared Error`)
is your issue still unsolved?
I'm working on a similar problem and I think I could help.
First you have to define a function that will be the sum of the errors, which has for variables Vmax and Tau.
Then you can call an optimisation algorithm that will change these variables and look for a minimum of your function. optim() might be sufficient for your application, but here is the documentation for nloptr:
https://www.rdocumentation.org/packages/nloptr/versions/1.0.4/topics/nloptr
and here is a list of optimisation packages in R:
https://cran.r-project.org/web/views/Optimization.html
Edit:
I quickly recoded the way I would do it. I'm a beginner, so it's probably not the best way but it still works.
A <- c(0,10, 20, 40)
B <- c(0,1.51, 2.51, 4.32)
Measured <- as.data.frame(cbind(A, B))
Corrected <- Measured
Corrected$B <- Corrected$B + .2
colnames(Corrected) <- c("Distance (yds)", "Time (s)")
Corrected$`X (m)` <- Corrected$`Distance (yds)`*.9144
#initialize values
Vmax0 = 15
Tau0 = 5
x0 = c(Vmax0,Tau0)
#define function to optimise: optim will minimize the output
f <- function(x) {
y=0
#variables will be optimise to find the minimum value of f
Vmax = x[1]
Tau = x[2]
Predicted_X <- Vmax * (Corrected$`Time (s)` - Tau + Tau*exp(-Corrected$`Time (s)`/Tau))
y = sum((Predicted_X - Corrected$`X (m)`)^2)
return(y)
}
#call optim: results will be available in variable Y
Y<-optim(x0,f)
If you type Y into the console, you will find that the solver finds the same values as Excel, and convergence is achieved.
In R, there is no need to define columns in data frames with brackets as you did, instead use vectors. You should probably follow a tutorial about this first.
Also it is misleading that you set inital values as values that were already the optimal ones. If you do this then optim() will not optimise further.
Here is the documentation for optim:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html
and a tutorial on how to use functions:
https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial
Cheers

How to simulate daily stock returns in R

I need to simulate a stock's daily returns. I am given r=(P(t+1)-P(t))/P(t) (normal distribution) mean of µ=1% and sd of σ =5%. P(t) is the stock price at end of day t. Simulate 100,000 instances of such daily returns.
Since I am a new R user, how do I setup t for this example. I am assuming P should be setup as:
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
You are getting it wrong: from what you wrote, the mean and the sd applies on the return and not on the price. I furthermore make the assumption that the mean is set for an annual basis (1% rate of return from one day to another is just ...huge!) and t moves along a day range of 252 days per year.
With these hypothesis, you can get a series of daily return in R with:
r = rnorm(100000, .01/252, .005)
Assuming the model you mentioned, you can get the serie of the prices P (containing 100001 elements, I will take P[1]=100 - change it with your own value if needed):
factor = 1 + r
temp = 100
P = c(100, sapply(1:100000, function(u){
p = factor[u]*temp
temp<<-p
p
}))
Your configuration for the return price you mention (mean=0.01 and sd=0.05) will however lead to exploding stock price (unrealistic model and parameters). Be carefull to check that prod(rate) will not return Inf .
Here is the result for the first 1000 values of P, representing 4 years:
plot(1:1000, P[1:1000])
One of the classical model (which does not mean this model is realistic) assumes the observed log return are following a normal distribution.
Hope this helps.
I see you already have an answer and ColonelBeauvel might have more domain knowledge than I (assuming this is business or finance homework.) I approached it a bit differently and am going to post a commented transcript. His method uses the <<- operator which is considered as a somewhat suspect strategy in R, although I must admit it seems quite elegant in this application. I suspect my method will probably be a lot faster if you ever get into doing large scale simulations.
Starting with your code:
P <- rnorm(100000, .01, .05)
# r=(P(t+1)-P(t))/P(t) definition, not R code
# inference: P_t+1 = r_t*P_t + P_t = P_t*(1+r_t)
# So, all future P's will be determined by P_1 and r_t
Since P_2 will be P_1*(1+r_1)r_1 then P_3 will be P_1*(1+r_1)*(1+r_2), .i.e a continued product of the vector (1+r) for which there is a vectorized function.
P <- P_1*cumprod(1+r)
#Error: object 'P_1' not found
P_1 <- 100
P <- P_1*cumprod(1+r)
#Error: object 'r' not found
# So the random simulation should have been for `r`, not P
r <- rnorm(100000, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
#Error in plot.window(...) : infinite axis extents [GEPretty(-inf,inf,5)]
str(P)
This occurred because the cumulative product went above the limits of numerical capacity and got assigned to Inf (infinity). Let's be a little more careful:
r <- rnorm(300, .01, .05)
P <- P_1*cumprod(1+r)
plot(P)
This strategy below iteratively updates the price at time t as 'temp' and multiplies it it by a single value. It's likely to be a lot slower.
r = rnorm(100000, .01/252, .005)
factor = 1 + r
temp = 100
P = c(100, sapply(1:300, function(u){
p = factor[u]*temp
temp<<-p
p
}))
> system.time( {r <- rnorm(10000, .01/250, .05)
+ P <- P_1*cumprod(1+r)
+ })
user system elapsed
0.001 0.000 0.002
> system.time({r = rnorm(10000, .01/252, .05)
+ factor = 1 + r
+ temp = 100
+ P = c(100, sapply(1:300, function(u){
+ p = factor[u]*temp
+ temp<<-p
+ p
+ }))})
user system elapsed
0.079 0.004 0.101
To simulate a log return of the daily stock, use the following method:
Consider working with 256 days of daily stock return data.
Load the original data into R
Create another data.frame for simulating Log return.
Code:
logr <- data.frame(Date=gati$Date[1:255], Shareprice=gati$Adj.Close[1:255], LogReturn=log(gati$Adj.Close[1:251]/gati$Adj.Close[2:256]))
gati is the dataset
Date and Adj.close are the variables
notice the [] values.
P <- rnorm(100000, .01, .05)
r=(P(t+1)-P(t))/P(t)
second line translates directly into :
r <- (P[-1] - P[length(P)]) / P[length(P)] # (1:5)[-1] gives 2:5
Stock returns are not normally distributed for Simple Returns ("R"), given their -1 lower bound per compounded period. However, Log Returns ("r") generally are. The below is adapted from #42's post above. There don't seem to be any solutions to simulating from Log Mean ("Expected Return") and Log Stdev ("Risk") in #Rstats, so I've included them here for those looking for "Monte Carlo Simulation using Log Expected Return and Log Standard Deviation"), which are normally distributed, and have no lower bound at -1. Note: from this single example, it would require looping over thousands of times to simulate a portfolio--i.e., stacking 100k plots like the below and averaging a single slice to calculate a portfolio's average expected return at a chosen forward month. The below should give a good basis for doing so.
startPrice = 100
forwardPeriods = 12*10 # 10 years * 12 months with Month-over-Month E[r]
factor = exp(rnorm(forwardPeriods, .04, .10)) # Monthly Expected Ln Return = .04 and Expected Monthly Risk = .1
temp = startPrice
P = c(startPrice, sapply(1:forwardPeriods, function(u){p = factor[u]*temp; temp <<- p; p}))
plot(P, type = "b", xlab = "Forward End of Month Prices", ylab = "Expected Price from Log E[r]", ylim = c(0,max(P)))
n <- length(P)
logRet <- log(P[-1]/P[-n])
# Notice, with many samples this nearly matches our initial log E[r] and stdev(r)
mean(logRet)
# [1] 0.04540838
sqrt(var(logRet))
# [1] 0.1055676
If tested with a negative log expected return, the price should not fall below zero. The other examples, will return negative prices with negative expected returns. The code I've shared here can be tested to confirm that negative prices do not exist in the simulation.
min(P)
# [1] 100
max(P)
# [1] 23252.67
Horizontal axis is number of days, and vertical axis is price.
n_prices <- 1000
volatility <- 0.2
amplitude <- 10
chng <- amplitude * rnorm(n_prices, 0, volatility)
prices <- cumsum(chng)
plot(prices, type='l')

Resources