Extremely high probability of being alive BTYD R - r

I am working on BTYD R package and the problem is that the values of the probability that a customer is alive at the end of calibration are extremely high. Even observations with only one transaction in calibration period have this probability around 0.9999. I know that the parameter "s" (estimated by the package) is used in this calculation. My gamma is very low (almost 0). When I tried to change it manually for higher value the probabilities went down. Any idea how to deal with this problem? I attach my codes below.
elog <- dc.MergeTransactionsOnSameDate(elog)
end.of.cal.period <- min(elog$date)+as.numeric((max(elog$date)-min(elog$date))/2)
data <- dc.ElogToCbsCbt(elog, per="week",
T.cal=end.of.cal.period,
merge.same.date=TRUE,
statistic = "freq")
cal2.cbs <- as.matrix(data[[1]][[1]])
## prameters estimation
params2 <- pnbd.EstimateParameters(cal2.cbs)
## log likehood
(LL <- pnbd.cbs.LL(params2, cal2.cbs))
p.matrix <- c(params2, LL)
for (i in 1:20) {
params2 <- pnbd.EstimateParameters(cal2.cbs, params2)
LL <- pnbd.cbs.LL(params2, cal2.cbs)
p.matrix.row <- c(params2, LL)
p.matrix <- rbind(p.matrix, p.matrix.row)
}
(params2 <- p.matrix[dim(p.matrix)[1],1:4])
# set up parameter names for a more descriptive result
param.names <- c("r", "alpha", "s", "beta")
LL <- pnbd.cbs.LL(params2, cal2.cbs)
# PROBABILITY A CUSTOMER IS ALIVE AT END OF CALIBRATION / TRAINING
x <- cal2.cbs["123", "x"] # x is frequency
t.x <- cal2.cbs["123", "t.x"] # t.x is recency, ie time of last transactions
T.cal <- 26 # week of end of cal, i.e. present
pnbd.PAlive(params2, x, t.x, T.cal)

There is no "gamma" parameter being estimated - "s" and "beta" define the gamma distribution of dropout rate heterogeneity. I recommend editing your post to include the parameters, as well as the output of
pnbd.PlotDropoutRateHeterogeneity(params2)
Without seeing your parameter estimates or knowing the context of your data, there are at least two (not mutually exclusive) possibilities.
First, you could have very low (e.g., zero) dropout rate. If so, you can still fit a plain NBD model of transaction rate, and assume a zero dropout rate.
Second, you could be seeing the "increasing frequency paradox". From pages 17-19 of one of Peter Fader/Bruce Hardie's papers:
For low frequency customers, there is an almost linear relationship between recency and [expected transactions]. However, this relationship becomes highly
nonlinear for high frequency customers. In other words, for customers
who have made a relatively large number of transactions in the past,
recency plays a much bigger role in determining [value] than for an
infrequent past purchaser.
According to the authors, a customer such as you describe with few (or even just a single) transaction receive a high probability of being "alive" with less dependency on recency. This is because by definition, a low frequency customer can have long "gaps" between purchases. Therefore we should assign less risk to a lower frequency customer even if they have not transacted for some time. Compare this to a high frequency customer - the longer we go without seeing a transaction, the faster we should could conclude that the customer is "dead" since we know they would ordinarily being making many transactions.

Related

Why do we discard the first 10000 simulation data?

The following code comes from this book, Statistics and Data Analysis For Financial Engineering, which describes how to generate simulation data of ARCH(1) model.
library(TSA)
library(tseries)
n = 10200
set.seed("7484")
e = rnorm(n)
a = e
y = e
sig2 = e^2
omega = 1
alpha = 0.55
phi = 0.8
mu = 0.1
omega/(1-alpha) ; sqrt(omega/(1-alpha))
for (t in 2:n){
a[t] = sqrt(sig2[t])*e[t]
y[t] = mu + phi*(y[t-1]-mu) + a[t]
sig2[t+1] = omega + alpha * a[t]^2
}
plot(e[10001:n],type="l",xlab="t",ylab=expression(epsilon),main="(a) white noise")
My question is that why we need to discard the first 10000 simulation?
========================================================
Bottom Line Up Front
Truncation is needed to deal with sampling bias introduced by the simulation model's initialization when the simulation output is a time series.
Details
Not all simulations require truncation of initial data. If a simulation produces independent observations, then no truncation is needed. The problem arises when the simulation output is a time series. Time series differ from independent data because their observations are serially correlated (also known as autocorrelated). For positive correlations, the result is similar to having inertia—observations which are near neighbors tend to be similar to each other. This characteristic interacts with the reality that computer simulations are programs, and all state variables need to be initialized to something. The initialization is usually to a convenient state, such as "empty and idle" for a queueing service model where nobody is in line and the server is available to immediately help the first customer. As a result, that first customer experiences zero wait time with probability 1, which is certainly not the case for the wait time of some customer k where k > 1. Here's where serial correlation kicks us in the pants. If the first customer always has a zero wait time, that affects some unknown quantity of subsequent customer's experiences. On average they tend to be below the long term average wait time, but gravitate more towards that long term average as k, the customer number, increases. How long this "initialization bias" lingers depends on both how atypical the initialization is relative to the long term behavior, and the magnitude and duration of the serial correlation structure of the time series.
The average of a set of values yields an unbiased estimate of the population mean only if they belong to the same population, i.e., if E[Xi] = μ, a constant, for all i. In the previous paragraph, we argued that this is not the case for time series with serial correlation that are generated starting from a convenient but atypical state. The solution is to remove some (unknown) quantity of observations from the beginning of the data so that the remaining data all have the same expected value. This issue was first identified by Richard Conway in a RAND Corporation memo in 1961, and published in refereed journals in 1963 - [R.W. Conway, "Some tactical problems on digital simulation", Manag. Sci. 10(1963)47–61]. How to determine an optimal truncation amount has been and remains an active area of research in the field of simulation. My personal preference is for a technique called MSER, developed by Prof. Pres White (University of Virginia). It treats the end of the data set as the most reliable in terms of unbiasedness, and works its way towards the front using a fairly simple measure to detect when adding observations closer to the front produces a significant deviation. You can find more details in this 2011 Winter Simulation Conference paper if you're interested. Note that the 10,000 you used may be overkill, or it may be insufficient, depending on the magnitude and duration of serial correlation effects for your particular model.
It turns out that serial correlation causes other problems in addition to the issue of initialization bias. It also has a significant effect on the standard error of estimates, as pointed out at the bottom of page 489 of the WSC2011 paper, so people who calculate the i.i.d. estimator s2/n can be off by orders of magnitude on the estimated width of confidence intervals for their simulation output.

Birth/death model over time simulation

I'm trying to plot a graph that will show species number over time using a birth death model. So far all the packages and models I've seen will not allow me to input a starting number of species which is problematic as my extinction rate is higher than my speciation rate.
The data I have for this question is;
speciation rate (or b) is 0.0157
and extinction rate (or d) is 53.3.
starting number of species =250000.
currently I've tried using rbdtree and simbdtree.
Thank you in advance
Assuming you don't need the phylogenetic tree of the species that are generated, and just want to know the number of species over time, you don't really need a package for this: the code is fairly simple (if you know what equations to implement).
Also assuming that you want a stochastic simulation model.
Let's say for simplicity that you only want to know the numbers of species at the times at which speciation or extinction events occur (this is a little bit easier than figuring them out at equally spaced times). Let's say the per capita mutation/speciation/birth rate is b and the per capita death/extinction rate is d (it's fairly standard to assume these rates are constant per capita, you could make other assumptions if you wanted). Then if there are currently s species present, the total rate of events is (b+d)*s, and the probability that the next event is a birth is b/(b+d). The time is exponentially distributed.
num_events <- 1000000
s <- rep(NA, num_events+1)
t <- rep(NA, num_events+1)
t[1] <- 0
s[1] <- 250000
b <- 0.0157
d <- 53.3
set.seed(101)
for (i in 1:num_events) {
if (s[i]==0) break ## stop if extinct
delta_t <- rexp(1,rate=(b+d)*s[i])
t[i+1] <- t[i] + delta_t
if (runif(1)<(b/(b+d))) s[i+1] <- s[i]+1 else s[i+1] <- s[i]-1
}
plot(t,s,type="s",log="y")
curve(s[1]*exp((b-d)*x), add=TRUE, lwd=3, col=adjustcolor("red", alpha=0.4))
You can see that until you get down to a few 100 individuals, it's barely worth bothering with a stochastic simulation — the theoretical exponential decay curve matches the population dynamics almost exactly when the numbers are large.
The code could easily be sped up a little bit, at some small cost to interpretability (but it should take not more than a few seconds at most).

SVM in R (e1071): Give more recent data higher influence (weights for support vector machine?)

I'm working with Support Vector Machines from the e1071 package in R. This is my first project using SVM.
I have a dataset containing order histories of ~1k customers over 1 year and I want to predict costumer purchases. For every customer I have the information if a certain item (out of ~50) was bought or not in a certain week (for 52 weeks aka 1 yr).
My goal is to predict next month's purchases for every single customer.
I believe that a purchase let's say 1 month ago is more meaningful for my prediction than a purchase 10 months ago.
My question is now how I can give more recent data a higher impact? There is a 'weight' option in the svm-function but I'm not sure how to use it.
Anyone who can give me a hint? Would be much appreciated!
That's my code
# Fit model using Support Vecctor Machines
# install.packages("e1071")
library(e1071)
response <- train[,5]; # purchases
formula <- response ~ .;
tuned.svm <- tune.svm(train, response, probability=TRUE,
gamma=10^(-6:-3), cost=10^(1:2));
gamma.k <- tuned.svm$best.parameter[[1]];
cost.k <- tuned.svm$best.parameter[[2]];
svm.model <- svm(formula, data = train,
type='eps-regression', probability=TRUE,
gamma=gamma.k, cost=cost.k);
svm.pred <- predict(svm.model, test, probability=TRUE);
Side notes: I'm fitting a model for every single customer. Also, since I'm interested in the probability, that customer i buys item j in week k, I put
probability=TRUE
click here to see a sccreenshot of my data
Weights option in the R SVM Model is more towards assigning weights to solve the problem of imbalance classes. its class.Weights parameter and is used to assign weightage to different classes 1/0 in a biased dataset.
To answer your question: to give more weightage in a SVM Model for recent data, a simple trick in absence of an ibuild weight functionality at observation level is to repeat the recent columns (i.e. create duplicate rows for recent data) hence indirectly assigning them higher weight
Try this package: https://CRAN.R-project.org/package=WeightSVM
It uses a modified version of 'libsvm' and is able to deal with instance weighting. You can assign higher weights to recent data.
For example. You have simulated data (x,y)
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)
This is an unweighted SVM:
model1 <- wsvm(x, y, weight = rep(1,99))
Blue dots is the unweighted SVM and do not fit the first instance well. We want to put more weights on the first several instances.
So we can use a weighted SVM:
model2 <- wsvm(x, y, weight = seq(99,1,length.out = 99))
Green dots is the weighted SVM and fit the first instance better.

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

SVM in R, value of cost doesn't affect test error rate

I am currently using SVMs in R (e1071) with linear kernels to attempt to classify a high dimensional data set. It consists of around 300 patients with around 12000 gene activity levels measured for each patient. My goal is to predict patient response (binary: treatment effective or not) to a certain drug based upon these gene activities.
I want to establish the range of cost values to pass to the tune.svm function and this is where I am running into trouble. My understanding is that the way to do this is to try progressively smaller and larger values until lower and upper bounds for reasonable performance are respectively established; nevertheless, when I attempt to do this, no matter how large or small I make my possible costs, my resulting test error rate is never worse than about 50%. This is happening both with my actual data set and with this toy version. If this subset is too small I can provide a more significant chunk of it. Thanks for any advice.
My code:
dat.ex <- read.table("svm_ex.txt", header=T, row.names=1)
trainingSize <- 20
possibleCosts <- c(10^-50, 10^-25, 10^25, 10^50)
trainingDat <- sample(1:dim(dat.ex)[1], replace = FALSE, size = trainingSize)
ex.results <- vector()
for(i in 1:length(possibleCosts))
{
svm.ex <- svm(dat.ex[trainingDat, -1], factor(dat.ex[trainingDat, 1]), kernel="linear", cost=possibleCosts[i], type="C-classification")
test.ex <- predict(svm.ex, newdata=data.frame(x = dat.ex[-trainingDat,-1]))
truth.ex <- table(pred = test.ex, truth = factor(dat.ex[-trainingDat,1]))
exTestCorrectRate <- (truth.ex[1,1] + truth.ex[2,2])/(dim(dat.ex)[1] - trainingSize)
ex.results[i] <- exTestCorrectRate
}
print(ex.results)
First, you try ugly weird values of C. You should check the much smaller range of values (say between 1e-15 and 1e10) and in much geater resolution ( for example - 25 different values for the interval I suggested).
Second, you have very small dataset. 20 training vectors with 10 dimensions may be hard to model
I discovered the problem. In the full data set approximately 2/3 of the responses are 1 and 1/3 are 0. For these extreme parameters, every response was predicted to be 1 and thus test error rates in the range of 50% - 80% (with some fluctuations occurring due to training data selection) kept occurring.

Resources