How to estimate parameters from BG / NBD model using bbmle? - r

I am trying to estimate the probability for a customer to be alive after a certain period. I have data for 500 customers of a firm. For each customer I know x (number of transactions by a given customer over all time periods), tx (time of the last transaction) and T (total time between the first purchase and the end of the observation window).
I work with the BG/NBD model. In order to estimate the probability to be alive, I first need to estimate 4 parameters (r, alpha, a and b) included in this model. To optimize the value of these parameters, I am using the « bbmle » package (shown below).
However when I run the code, it doesn’t give any result. Furthermore it also seems that R doesn’t recognise many of the « objects » that are included in the functions below.
Does anyone notice any error that I’ve made in the code ? Is there another way to write it ?
bgLlh <- function(mydata, r, alpha, a, b) {
with (mydata, {
if (a<=0 | b<=0 | r<=0 | alpha<=0) return (NaN)
term1 <-log(gamma(r+mydata$x)) - log(gamma(r)) + r*log(alpha)
term2 <-log(gamma(a+b))+log(gamma(b+mydata$x))-log(gamma(b))-log(gamma(a+b+mydata$x))
term3<- -(r+mydata$x)*log(alpha+mydata$T)
term4 <- if(mydata$x > 1) {log(a)-log(b+mydata$x-1)-(r+mydata$x)*log(alpha+mydata$tx)
} else {0}
llh <- term1 + term2 +log(exp(term3)+(mydata$x>0)*exp(term4))
f <- -sum(llh)
return(f)
})
}
bgEstimateParameters <- function(mydata, initValues, safeMode=FALSE) {
llhd <- function(r, alpha, a, b) {
return (bgLlh(data, r, alpha, a, b))
}
library(bbmle)
if (safeMode) {
fit <- mle2(llhd, initValues, skip.hessian=TRUE, method="Nelder-Mead")
} else {
fit <- mle2(llhd, initValues)
}
return (fit)
}

Related

Double summation in function which will be used for log likelihood estimation

I am trying to get the estimation parameters for the following equation:
First, I need to write the function with double summation. Then, I will use any optimization method to get the estimated parameters (can be nlm, optim etc). Here is my attempt (though, I am not sure how to add the double summation part in the end, two for loops):
There are 5 parameters: alpha_x, sigma_x, lambda_x, theta_x, phi_x >>> step_3[1] == alpha_x
for (a in 1:nrow(returns[[4]][,1])){
abc <- 0
for (b in 0:7){
abc <- abc + (step_3[3]^b*(factorial(b))^-1*(sqrt(step_3[2]^2+b*step_3[5]^2))^-1*exp(-1*(as.numeric(returns[[4]][a,1])-(step_3[1]+b*step_3[4]))^2*(2*(step_3[2]^2+b*step_3[5]^2))^-1))
}
abcd <- abcd + log(abc)
}
abcd <- -abcd + 0.5*nrow(returns[[4]][,1])*log(2*pi) + nrow(returns[[4]][,1])*step_3[3]
return(abcd)
}
nlm(s3, step_3 <- c(0.01,0.01,0.01,0.01,0.01), hessian = TRUE )
I am getting the estimation results, but the estimates are way off:
My question is how to implement (efficiently, any suggestions for sapply etc. would be appreciated) the double summation (sum_0_T) > ln (sum_1_l) into a function. The function will be used in optimization.
(l=0:7, T=number of observations)
return1 = some vector with price returns (all returns between -1,1).
After the estimation sample mean and variance should be roughly equal to:
sample_mean = alpha_x + lambda_x * theta_x
sample_variance = sigma_x^2 + lambda_x * (theta_x^2 + phi_x^2)
Thanks.
Here is a sample that can be used for the analysis:
c(0.02423747 ,-0.00419738, -0.03153278, 0.05343888, -0.0175492, 0.00848472, 0.01043673, -0.02123556, 0.01954968, -0.06462473, 0.02679873, 0.07971938, 0.00050474, -0.01768566,
-0.05206762, -0.00413691, 0.06390499, 0.00269576, 0.01520837, 0.00091051, 0.03499043, -0.00121999, -0.00123521, -0.01961684, 0.03360355, 0.01803711, 0.01772631, 0.036523
-0.00038927, 0.00905013, 0.01150976, 0.00480223, 0.01916402, 0.00054628, 0.01911904, 0.02194556, 0.00371314, 0.03376601, 0.0546574, -0.03972611, -0.0272525, 0.00271509,
0.02137819, 0.00483075, 0.03538795, 0.02981431, 0.00428509, -0.07192935, 0.01770175, -0.09626522, 0.07574215, 0.02929555, 0.01776551, 0.0385604, -0.06804089, 0.0666583,
0.01304272, -0.01825728, 0.01703525, 0.02022584, 0.03348027, 0.02818876, -0.00162942, -0.08785954, -0.13366772, 0.10243928)

Likelihood for First order censored autoregressive process with covariate(see Jung Wook Park1 , Marc G. Genton2 and Sujit K. Ghosh3)

library(mvtnorm)
set.seed(14)
n=10000
sigmatrue<-1
rhotrue<-0.3
b1=0.05
b0=0
y<-arima.sim(model=list(ar=c(0.3)),n=10000 ,sd=sigmatrue)#kataskevi
#xronoseiras
x=rep(0,n)
for(i in 1:n){
x[i]=i
}
for(t in 1:n)
{
y[t]=y[t]+b0+b1*x[t]
}
est=arima(y,order=c(1,0,0),xreg=x,include.mean=TRUE,method="ML",kappa=1e+06)
cens<-rep(0, n)
c=(9/10)*(n*b1+b0)
for (i in 1:n) {
if(y[i]>c){
y[i]<-c
cens[i]<-1
}
}
ll<-function(p){
sigma=matrix(c(p[2]^2/(1-p[3]^2), p[2]^2*p[3]/(1-p[3]^2),p[2]^2*p[3]/(1-p[3]^2),p[2]^2/(1-p[3]^2)),ncol=2,nrow=2,byrow=TRUE)
likelihood<-rep(0,n)
for(t in 2 :n){
if(cens[t]==0 & cens[t-1]==0){
likelihood[t]<-dnorm(((y[t]-(p[1]+p[4]*t)-p[3]*(y[t-1]-(p[1]+p[4]*(t-1)))/p[2]) )/p[2])
}
else if(cens[t]==0 & cens[t-1]==1){
likelihood[t]<-(1/(1-pnorm((c-(p[1]+p[4]*t)*sqrt(1-p[3]^2)/p[2]))*sqrt(1-p[3]^2)/p[2]*dnorm(((y[t]-(p[1]+p[4]*t)*sqrt(1-p[3]^2))/p[2])*(1-pnorm(((c-(p[1]+p[4]*(t))-p[3]*(y[t]-(p[1]+p[4]*(t-1)))/p[2])))))))
}
else if(cens[t]==1 & cens[t-1]==0){
likelihood[t]<-1-pnorm(((c-(p[1]+p[4]*t)-p[3]*(y[t-1]-(p[1]+p[4]*(t-1)))/p[2])))
}
else
{
likelihood[t]<-(((pmvnorm(lower=c, upper=Inf , mean=c(p[1]+p[4]*(t-1),p[1]+p[4]*t),sigma=sigma))/(1-pnorm((c-(p[1]+p[4]*(t-1))*sqrt(1-p[3]^2)/p[2])))))
}
}
f0=(sqrt(1-p[3])/p[2]*dnorm(((y[1]-p[1]-p[4])*sqrt(1-p[3]^2))/p[2]))
likelihood[1]=f0
#Ta prosthesa
if (any(likelihood==0)){
likelihood[likelihood==0] = 0.000001 #poly mikros arithmos
}
if (any(likelihood==Inf)){
likelihood[likelihood==Inf] = 1 #poly megalos h 1, an milame gia pi8anothta
}
if (any(is.nan(likelihood))){
likelihood[is.nan(likelihood)] = 0.000001
}
minusloglike=-sum(log(likelihood))
#l1=list(Minusloglike=minusloglike,Loglikelihood=log(likelihood))
return(minusloglike)
}
fit<-optim(c(0,1,0.3,0.05),ll,method="L-BFGS-B",lower=c(-Inf,0.001,-0.999,-Inf),upper = c(Inf,Inf,0.999,Inf),hessian=TRUE)
fisher.info<-solve(fit$hessian)
fisher.info
prop.sigma<-sqrt(diag(fisher.info))
sigmas<-diag(prop.sigma)
upper<-fit$par+1.96*sigmas
lower<-fit$par-1.96*sigmas
interval<-data.frame(value=fit$par, lower=diag(lower),upper=diag(upper))
interval
I run this code(it is for censored first order autogressive process with covariate , i have 4 cases for x(t) ,x(t-1) either is censored or non-censored and i dont want the likelihood to go near zero and inf).I get error
Error in if (any(likelihood == Inf)) { :
missing value where TRUE/FALSE needed
Called from: fn(par, ...)
The program is working for n=100 but when n is larger than 100 i have this error. I think this error causes bad estimattes of the four parameters(b1,rho,sigma,b0).Does anyone know what can i do?
Thank you for your help.

How can I define a constant variable inside a recursive function in R?

I want to create functions that compute the Simple Moving Average (SMA) and the Exponential Moving Average (EMA).
My problem is in the EMA implementation. I would like to set a constant variable inside the recursive function that is equal to one of its argument.
Here is my code:
#SMA
sma <- function(P,t,n)
{
return(sum(P[(t-n):(t-1)])/n)
}
#EMA
recursive.ema <- function(P,t,n)
{
# Here I want to create a constant variable that keeps in memory the first
# value of t, i.e. the value of t before the first recursion, so I can use
# it as argument of sma function.
# Something similar to this: tmp <- t (t given from outside the function)
b <- 2/(n+1)
if (t == 1)
{
return(b*P[1] + (1-b)*sma(P,tmp,n))
}
return (b*P[t] + (1-b)*recursive.ema(P,t-1,n))
}
Here P is a vector, time series of prices, t is the index of the vector, the time in my model, and n is any positive number, corresponding to n lagged periods at time t.
Would this work ?
e <-new.env()
test<-TRUE
assign("test",test,e)
recursive.ema <- function(P,t,n)
{
test <-get("test",envir=e)
if (test) {
assign("t",t,envir=e)
assign("test",FALSE,envir=e)
} else {
t=get("t",envir=e)
}
b <- 2/(n+1)
if (t == 1)
{
return(b*P[1] + (1-b)*sma(P,tmp,n))
}
return (b*P[t] + (1-b)*recursive.ema(P,t-1,n))
}

Incorporating a stop function in a random walk

In my previous question:How do I put arena limits on a random walk? the community helped create a random walk function in a set arena. This function is designed to simulate a fish moving through an area, but now I need to make it decide when to stop when a certain condition is satisfied.
I thought it would be as simple as
{{if(z>P)break}} put in just before the loop function. What I want it to understand is "if this condition is satisfied then stop, otherwise keep going until you reach the maximum number of steps.
Instead it caused my random walk to become deterministic (I always get the same path and it never stops before step.max).
Main question: How do I tell the random walk to stop if z>P?
For reference:
step.max<-125
step.prob<-function(n.times=step.max){
draw=sample(0:100,1,replace=T)
CS<-sample(draw,size=1,replace=TRUE)
CS.max<-100
step.num<-15
SP<-((CS/CS.max)*(1-(step.num/step.max))+(step.num/step.max))*100
if(SP>P){stop('Settled at step number',P)}else{SP
}
}
z<-step.prob(1) #renaming the above function to be easier to reference later
P<-80 #preset cutoff point for value z, ranges from 0-100
walkE <- function(n.times=125,
xlim=c(524058,542800),
ylim=c(2799758,2818500),
start=c(525000,2810000),
stepsize=c(4000,4000)) {
plot(c(0,0),type="n",xlim=xlim,ylim=ylim,
xlab="Easting",ylab="Northing")
x <- start[1]
y <- start[2]
steps <- 1/c(1,2,4,8,12,16)
steps.y <- c(steps,-steps,0)
steps.x <- c(steps,-steps[c(1,5,6)],0)
points(x,y,pch=16,col="red",cex=1)
for (i in 1:n.times) {
repeat {
xi <- stepsize[1]*sample(steps.x,1)
yi <- stepsize[2]*sample(steps.y,1)
newx <- x+xi
newy <- y+yi
if (newx>xlim[1] && newx<xlim[2] &&
newy>ylim[1] && newy<ylim[2]) break
}
lines(c(x,newx),c(y,newy),col="blue")
x <- newx
y <- newy
if(z>P){stop(points(newx,newy,col="green",cex=1))}
#this is where I want it to stop if z>P
else
if(z<P){points(newx,newy,pch=1,col="blue",cex=1)}
else
if(step.max){points(newx,newy,pch=16,col="green",cex=1)}
set.seed(101)}
}
walkE(step.max) #run above random walk function walkE looped for the step.max number
Thanks in advance!!!
This is pretty easy and can be accomplished by inserting a stop(...) function in your user defined step.prob function.
step.prob<-function(n.times=step.max, p){
draw=sample(0:100,1,replace=T)
CS<-sample(draw,size=1,replace=TRUE)
CS.max<-100
CS.max
step.num<-15
SP<-((CS/CS.max)*(1-(step.num/step.max))+(step.num/step.max))*100
if(SP > p) {
stop('Your random walk exceeded ', p)
} else {
SP
}
}
If this doesn't do it for you look into the break command.
So, when the random walk value is > p:
step.prob(p=300000)
# Error in step.prob(p = 3) : Your random walk exceeded 3
And if you want to set the value returned by the function to p you can just add in SP <- p before the stop command.

R program keeps getting numeric(0) answer

I am a beginner in R. Here's the formula I'm trying to code to find the lambda that maximizes the log likelihood of some bigrams. When the bigrams are not found, the P_b (bigram) function fails, but the P_u (unigram) function should provide the unigram result (lambda = 0).
It works for bigrams that are found. When they're not found, tho, I only get numeric(0), not the unigram result.
p.mix <- function(w2, w1) {
(1-lambda) * uni.dfrm$prob[uni.dfrm$token==w2] + lambda * p.bi(w2,w1)
}
The p.bi() function looks complicated because of the indexing so I'm reluctant to post it but it does work when the bigrams are found. It just looks up the count of times w' appears after w and divides it by the times w appears, but I have to go through another vector of vocabulary words so it looks ugly.
When w' is never found occurring after w, instead of a zero count, there's no row at all, which is what apparently causes the numeric(0) result. That's what the mixed model is supposed to solve, but I can't get it to work. Any ideas how this can work?
You can add a test for the case where w2 is numeric(0) for example :
p.mix <- function(w2, w1) {
if(length(w2)>0){
res <- (1-lambda) * uni.dfrm$prob[uni.dfrm$token==w2] +
lambda * p.bi(w2,w1)
}else res <- 0
res
}
EDIT
p.mix <- function(w2, w1) {
if(length(w2) && length(uni.dfrm$prob[uni.dfrm$token==w2]) > 0)
(1-lambda) * uni.dfrm$prob[uni.dfrm$token==w2] + lambda * p.bi(w2,w1)
else 0
}

Resources