Writing a log-likelihood as a function in R (what is theta?) - r

I have the following log-likelihood from my model which i am trying to write as a function in R.
My issue come as i dont know how to write theta in terms of the the function. I have had a couple of attempts at this as shown below, any tips/advice on if these are close to being correct could be appreciated.
function with theta written as theta
#my likelihood function
mylikelihood = function(beta) {
#log-likelihood
result = sum(log(dengue$cases + theta + 1 / dengue$cases)) +
sum(theta*log(theta / theta + exp(beta[1]+beta[2]*dengue$time))) +
sum(theta * log(exp(beta[1]+beta[2]*dengue$time / dengue$cases + exp(beta[1]+beta[2]*dengue$time))))
#return negative log-likelihood
return(-result)
}
my next attempt with thetas replaced with Xi from my dataset, which here is (dengue$time)
#my likelihood function attempt 2
mylikelihood = function(beta) {
#log-likelihood
result = sum((log(dengue$Cases + dengue$Time + 1 / dengue$Cases))) +
sum(dengue$Time*log(dengue$time / dengue$Time + exp(beta[1]+beta[2]*dengue$Time))) +
sum(dengue$Cases * log(exp(beta[1]+beta[2]*dengue$Time / dengue$Cases +
exp(beta[1]+beta[2]*dengue$Time))))
#return negative log-likelihood
return(-result)
}
data
head(dengue)
Cases Week Time
1 148 36 1
2 275 37 2
3 205 38 3
4 133 39 4
5 123 40 5
6 138 41 6
Are either of these close to being correct, and if not where am I going wrong?
Updated into about where the log-likelihood comes from;
The model;
Negative Binomial distribution with mean µ and dispersion parameter θ has pmf;

The fundamental problem is that you have to pass both beta (intercept and slope of one component of the problem) and theta as part of a single parameter vector. You had other problems with parenthesis placement that I fixed, and I reorganized the expressions a little bit.
There are several more important mistakes in your code.
The first term is not a fraction; it is a binomial coefficient. (i.e., you should use lchoose(), as shown below)
You changed a +1 to a -1 in the first term.
nll <- function(pars) {
beta <- pars[1:2]
theta <- pars[3]
##log-likelihood
yi <- dengue$Cases
xi <- dengue$Time
ri <- exp(beta[1]+beta[2]*xi)
result <- sum(lchoose(yi + theta - 1,yi)) +
sum(theta*log(theta / (theta + ri))) +
sum(yi * log(ri/(theta+ri)))
##return negative log-likelihood
return(-result)
}
read data
dengue <- read.table(row.names = 1, header = TRUE, text = "
Cases Week Time
1 148 36 1
2 275 37 2
3 205 38 3
4 133 39 4
5 123 40 5
6 138 41 6
")
fitting
Guessing starting parameters of (1,1,1) is a bit dangerous - it would make more sense to know something about the meaning of the parameters and guess biologically plausible values - but it seems to be OK.
nll(c(1,1,1))
optim(par = c(1,1,1), nll)
Since we didn't constrain theta to be positive we get some warnings about taking the log of a negative number, but these are probably harmless (e.g. see here)
alternatives
R has a lot of built-in machinery for fitting negative binomial models (I should have recognized what you were doing!)
MASS::glm.nb sets everything up for you automatically, you just have to specify the predictor variables (it uses a logarithmic link and adds an intercept, so specifying ~Time will make the mean equal to exp(beta0 + beta1*Time)).
library(MASS)
glm.nb(Cases ~ Time, data = dengue)
bbmle is a little bit less automated, but more flexible (here I am fitting theta on the log scale to avoid trying any negative values)
library(bbmle)
mle2(Cases ~ dnbinom(mu = exp(logmu), size = exp(logtheta)),
parameters = list(logmu ~ Time),
data = dengue,
start = list(logmu = 0, logtheta = 0))
All three of these approaches (corrected negative log-likelihood function + optim, MASS::glm.nb, bbmle::mle2) give the same results.

Related

Why is Adam optimization unable to converge in linear regression?

I am studying Adam optimizer. This is a toy problem. In R, I generate some artificial data:
Y = c0 + c1 * x1 + c2 * x2 + noise
In the above equation, x1, x2 and noise are normal random numbers I generated in R, theta = [c0, c1, c2] is the parameter I try to estimate with Adam optimizer. For this simple regression problem, I can use analytical method to determine the theta parameter which is the k in my R codes below.
Regarding Adam algorithm, I use the formulae from this site
Overview: Adam
I change the step size eta in this parametric study. The final theta from Adam algorithm is not the same as the analytical solution k in my R codes.
I checked my codes many times. I run the codes line by line and cannot understand why Adam algorithm cannot converge.
Added:
I changed the algorithm to AMSGrad. It perform better than Adam in this case. However, AMSGrad does not converge.
rm(list = ls())
n=500
x1=rnorm(n,mean=6,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(intercept=rep(1,n),x1,x2))
Y=-20+51*x1-15*x2+rnorm(n,mean=0,sd=2);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=4000 # total number of steps
epsilon = 0.0001 # set precision
eta=0.04 # step size
beta1=0.9
beta2=0.999
t1=integer(iterations)
t2=matrix(0,iterations,3)
t3=integer(iterations)
epsilon1=1E-8 # small number defined for numerical computation
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(intercept=rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=2),1,np) # Initialize theta:1 x np matrix
m_i=matrix(0,1,np) # initialization, zero vector
v_i=matrix(0,1,np) # initialization, zero vector
for(i in 1:iterations){
error=theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=1/N*error%*%X # Gradient grad is 1 x np vector
m_i=beta1*m_i+(1-beta1)*grad # moving average of gradient, 1 x np vector
v_i=beta2*v_i+(1-beta2)*grad^2 # moving average of squared gradients, 1 x np vector
# corrected moving averages
m_corrected=m_i/(1-beta1^i)
v_corrected=v_i/(1-beta2^i)
d_theta=eta/(sqrt(v_corrected)+epsilon1)*m_corrected
theta=theta-d_theta
L=sqrt(sum((d_theta)^2)) # calculating the L2 norm
t1[i]=L # record the L2 norm in each step
if ((is.infinite(L))||(is.nan(L))) {
print("Learning rate is too large. Lowering the rate may help.")
break
}
else if (L<=epsilon) {
print("Algorithm convergence is reached.")
break # checking whether convergence is obtained or not
}
# if (i==1){
# browser()
# }
}
plot(t1,type="l",ylab="norm",lwd=3,col=rgb(0,0,1))
k
theta

How to simulate a dataset with a binary target in proportions determined 'a-priori'?

Can someone tell me what is the best way to simulate a dataset with a binary target?
I understand the way in which a dataset can be simulated but what I'm looking for is to determine 'a-priori' the proportion of each class. What I thought was to change the intercept to achieve it but I couldn't do it and I don't know why. I guess because the average is playing a trick on me.
set.seed(666)
x1 = rnorm(1000)
x2 = rnorm(1000)
p=0.25 # <<< I'm looking for a 25%/75%
mean_z=log(p/(1-p))
b0 = mean( mean_z - (4*x1 + 3*x2)) # = mean_z - mean( 2*x1 + 3*x2)
z = b0 + 4*x1 + 3*x2 # = mean_z - (4*x1 + 3*x2) + (4*x1 + 3*x2) = rep(mean_z,1000)
mean( b0 + 4*x1 + 3*x2 ) == mean_z # TRUE!!
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
mean(pr) # ~ 40% << not achieved
table(y)/1000
What I'm looking for is to simulate the typical "logistic" problem in which the binary target can be modeled as a linear combination of features.
These 'logistic' models assume that the log-odd ratio of the binary variable behaves linearly. That means:
log (p / (1-p)) = z = b0 + b1 * x1 + b2 * x2 where p = prob (y = 1)
Going back to my sample code, we could do, for example: z = 1.3 + 4 * x1 + 2 * x2 , but the probability of the class would be a result. Or instead we could choose coefficient b0 such that the probability is (statistically) similar to the one sought :
log (0.25 / 0.75) = b0 + 4 * x1 + 2 * x2
This is my approach, but there may be betters
I gather that you are considering a logistic regression model, right? If so, one way to generate a data set is to create two Gaussian bumps and say that one is class 1 and the other is class 0. Then generate 25 items from class 1 and 75 items from class 0. Then each generated item plus its label is a datum or record or whatever you want to call it.
Obviously you can choose any proportions of 1's and 0's. It is also interesting to make the problem "easy" by making the Gaussian bumps farther apart (i.e. variances smaller in comparison to difference of means) or "hard" by making the bumps overlapping (i.e. variances larger compared to difference of means).
EDIT: In order to make sample data which correspond exactly to a logistic regression model, just make the variances of the two Gaussian bumps the same. When the variances (by this I mean specifically the covariance matrix) are the same, the surfaces of equal posterior class probability are planes; when the covariances are different, the surfaces of equal probability are quadratics. This is a standard result which will appear in many textbooks. I also have some notes online about this, which I can locate if it will help.
Aside from generating the two classes separately and then merging the results into one set, you can also sample from a single distribution over x, plug x into a logistic regression model with some weights (which you choose by any means you wish), and then use the resulting output as a probability for a coin toss. This method isn't guaranteed to output proportions that correspond exactly to prior class probabilities.

Accuracy is different between "tune" and "predict" in R

I am making SVM which will differentiate beforehand and afterward of track maintenance using on-board accelerometer on train car. There is focused section and I extracted the acceleration data corresponding to that section. Each run should take around 3 minutes to pass that section, so considering that sampling rate of accelerometer is around 1600/s, there are around 3min * 60sec * 1600/s = 288,000 records of acceleration data for each run. Then I calculate variance, maximum, minimum, mean, standard deviation and most frequent value of those acceleration records of each run. There are around 250 runs, so I made the dataset of those calculated value of 250 runs. Then also I put classification of beforehand and afterward of track maintenance, depending on the maintenance record and the date of each run.
Using this, I tried to make SVM as I mentioned. At first, I tried to find optimal parameters of gamma and cost in gaussian kernel function, so I used "tune" to do grid search. Then I got the result as follow:
> source("grid_search.R")
[gamma = 1 , cost = 10 ]
- best parameters:
gamma = 1.584893 ; cost = 25.11886 ;
accuracy: 88.54935 %
Also "grid_search.R" is as follow:
gamma <- 10^(0.0)
cost <- 10^(1.0)
gammaRange <- 10^seq(log10(gamma)-1,log10(gamma)+1,length=11)[2:10]
costRange <- 10^seq(log10(cost)-1 ,log10(cost)+1 ,length=11)[2:10]
t <- tune.svm(Category ~ ., data = X, gamma=gammaRange, cost=costRange,
tunecontrol = tune.control(sampling="cross", cross=8))
cat("[gamma =", gamma, ", cost =" , cost , "]\n")
cat("- best parameters:\n")
cat("gamma =", t$best.parameters$gamma, "; cost =", t$best.parameters$cost, ";\n")
cat("accuracy:", 100 - t$best.performance * 100, "%\n\n")
plot(t, transform.x=log10, transform.y=log10, zlim=c(0,0.1))
After that, using "svm" with the parameter "gamma = 1.584893 ; cost = 25.11886 ;" found as above, I trained SVM and tried to predict to classify the data which is used for training SVM as following:
gamma = 1.584893 ; cost = 25.11886;
model <- svm(Category ~ ., data = X, gamma=gamma, cost=cost)
pred <- predict(model, X)
table(pred, X[,13])
And I got result as following matrix:
pred after before
after 47 2
before 1 185
My question is: Depending on the matrix above, accuracy can be said as
1 - (1 + 2)/(47 + 2 + 1 + 185) = 0.987234 (98.7%)
But I also got "accuracy: 88.54935 %" as a result of "tune" when I got optimal parameters such as "gamma = 1.584893 ; cost = 25.11886;".

Using antiD function for variance of gamma distribution

This is my first post here and I hope I'll follow all the rules of the community.
I'm trying to calculate variance of gamma distribution with shape parameter 2 and scale parameter 3 in R using function antiD from mosaic package. The R code I use is the following
stopifnot(require(mosaic))
f <- function(y) {
dgamma(y, shape = 2, scale = 3)
}
mean_integral <- antiD( z*f(z) ~ z )
mn <- mean_integral(10^4)
g <- function(y) {
(y - mn)^2
}
variance <- antiD(f(x)*g(x) ~ x)
variance(10^5)
## [1] 7.115334e-09
The problem is that the number I get doesn't make sense as the variance for Gamma distribution with those parameters should be equal to 2*3^2 = 18 (Wiki page on Gamma distribution). Moreover if I put 10^4 as an upper bound (the default lower bound is 0) for variance() it will return the following:
variance(10^4)
## [1] 18
And the integral from 10^4 to 10^5 will be:
variance(10^5) - variance(10^4)
## [1] -18
Does anyone know why variance(10^5) produce nonsensical results in this case? I also will be grateful for any additional comments on the style of the post.

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Resources