I am trying to fit a exponentially modified gaussian (like in https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution equation (1)) to my 2D (x, y) data in R.
My data are:
x <- c(1.13669371604919, 1.14107275009155, 1.14545404911041, 1.14983117580414,
1.15421032905579, 1.15859162807465, 1.16296875476837, 1.16734790802002,
1.17172694206238, 1.17610621452332, 1.18048334121704, 1.18486452102661,
1.18924164772034, 1.19362080097198, 1.19800209999084, 1.20237922668457,
1.20675826072693, 1.21113955974579, 1.21551668643951, 1.21989583969116,
1.22427713871002, 1.22865414619446, 1.2330334186554, 1.23741245269775,
1.24178957939148, 1.24616885185242, 1.25055003166199, 1.25492715835571,
1.25930631160736, 1.26368761062622, 1.26806473731995, 1.2724437713623
)
y <- c(42384.03125, 65262.62890625, 235535.828125, 758616, 1691651.75,
3956937.25, 8939261, 20311304, 41061724, 65143896, 72517440,
96397368, 93956264, 87773568, 82922064, 67289832, 52540768, 50410896,
35995212, 27459486, 14173627, 12645145, 10069048, 4290783.5,
2999174.5, 2759047.5, 1610762.625, 1514802, 958150.6875, 593638.6875,
368925.8125, 172826.921875)
The function I am trying to fit and the value I am trying to minimize for optimization:
EMGCurve <- function(x, par)
{
ta <- 1/par[1]
mu <- par[2]
si <- par[3]
h <- par[4]
Fct.V <- (h * si / ta) * (pi/2)^0.5 * exp(0.5 * (si / ta)^2 - (x - mu)/ta)
Fct.V
}
RMSE <- function(par)
{
Fct.V <- EMGCurve(x,par)
sqrt(sum((signal - Fct.V)^2)/length(signal))
}
result <- optim(c(1, x[which.max(y)], unname(quantile(x)[4]-quantile(x)[2]), max(y)),
lower = c(1, min(x), 0.0001, 0.1*max(y)),
upper = c(Inf, max(x), 0.5*(max(x) - min(x)), max(y)),
RMSE, method="L-BFGS-B", control=list(factr=1e7))
However, when I try to vizualize the result in the end it seems like nothing usful is happening,..
plot(x,y,xlab="RT/min",ylab="I")
lines(seq(min(x),max(x),length=1000),GaussCurve(seq(min(x),max(x),length=1000),result$par),col=2)
However, for some reason it doesn't work at all, although a managed to do it for a normal distribution with similar code. Would be great if someone has an idea?
If it might be of some use, I got an OK fit to your data using an X-shifted log-normal type peak equation, "y = a * exp(-0.5 * pow((log(x-d)-b) / c, 2.0))" with parameters a = 9.4159743234392539E+07, b = -2.7516932481669185E+00, c = -2.4343893243720971E-01, and d = 1.1251623071481867E+00 yielding R-squared = 0.994 and RMSE = 2.49E06. I personally was unable to fit using the equation in your post. There may be value in scaling the dependent data as the values seem large, but this equation seems to fit the data as is.
Related
I'm trying to estimate a model using MLE and it's been requested that we are as explicit as possible with our method, this means avoiding commands like mle and stuff
I have a problem with the integral that needs to be solved in order to estimate my model
Here's my code:
LL <- function(params) {
Sigma_e <- params[1:4]
W1 <- params[5:18]
W2 <- params[19:22]
W3 <- params[23:36]
W4 <- params[37:60]
Alpha_t <- params[61:63]
Sigma_theta <- params[64]
Betas <- params [65:70]
Alpha <- params[71:72]
Sigma <- params[73:74]
Gamma <- params[75:79]
Alpha_I <- params[80]
integrand <- function(x) {
((dnorm(Y0 - as.matrix(X)%*%Betas[1:3] - x*Alpha[1] , sd = Sigma[1]))*(1 - pnorm(as.matrix(Z)%*%Gamma+x*Alpha_I)))^(1-D) *
((dnorm(Y1 - as.matrix(X)%*%Betas[4:6] - x*Alpha[2] , sd = Sigma[2]))*(pnorm(as.matrix(Z)%*%Gamma+x*Alpha_I)))^(D) *
(dnorm(T1 - as.matrix(W)%*%W1[1:14] - x, sd = Sigma_e[1])) *
(dnorm(T2 - as.matrix(W)%*%W2[1:14] - x*Alpha_t[1], sd = Sigma_e[2])) *
(dnorm(T3 - as.matrix(W)%*%W3[1:14] - x*Alpha_t[2], sd = Sigma_e[3])) *
(dnorm(T4 - as.matrix(W)%*%W4[1:14] - x*Alpha_t[3], sd = Sigma_e[4])) *
(dnorm(x, sd = Sigma_theta))
}
q = integrate(integrand,-Inf,Inf)
return(-sum(log(q)))
}
estimates <- optim(start_params, LL, method = "BFGS")
I'm pretty sure that the problem is with the integral because the error says something about that. This might not be the optimal way to do it, but it's the solution I came up with
I need to implement a logistic regression manually, using the Score/GMM approach, without the use of GLM. This is because at later stages the model will be much more complicated. Currently I am running into a problem where for the logistic regression, the optimization procedures are very initial point dependent.To illustrate, here is my code using an online dataset. More details about the procedure are in the comments:
library(data,table)
library(nleqslv)
library(Matrix)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
data_analysis<-data.table(mydata)
data_analysis[,constant:=1]
#Likelihood function for logit
#The logistic regression will regress the binary variable
#admit on a constant and the variable gpa
LL <- function(beta){
beta=as.numeric(beta)
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
llf <- sum(data_temp$admit * log(choice_prob)) + (sum((one-data_temp$admit) * log(one-choice_prob)))
return(-1*llf)
}
#Score to be used when optimizing using LL
#Identical to the Score function below but returns negative output
Score_LL <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(-1*as.numeric(score_final2))
}
#The Score/Deriv/Jacobian of the Likelihood function
Score <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = as.numeric(h/(1+h))
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(as.numeric(score_final2))
}
#Derivative of the Score function
Score_Deriv <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
weight = (h/(1+h)) * (1- (h/(1+h)))
weight_mat = Diagonal(length(weight), x=weight)
deriv = t(mat_temp2)%*%weight_mat%*%mat_temp2
return(-1*as.array(deriv))
}
#Quadratic Gain function
#Minimized at Score=0 and so minimizing is equivalent to solving the
#FOC of the Likelihood. This is the GMM approach.
Quad_Gain<- function(beta){
h=Score(as.numeric(beta))
return(sum(h*h))
}
#Derivative of the Quadratic Gain function
Quad_Gain_deriv <- function(beta){
return(2*t(Score_Deriv(beta))%*%Score(beta))
}
sol1=glm(admit ~ gpa, data = data_analysis, family = "binomial")
sol2=optim(c(2,2),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
sol3=optim(c(0,0),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
When I run this code, I get that sol3 matches what glm produces (sol1) but sol2, with a different initial point, differs from the glm solution by a lot. This is something happening in my main code with the actual data as well. One solution is to create a grid and test multiple starting points. However, my main data set has 10 parameters and this would make the grid very large and the program computationally infeasible. Is there a way around this problem?
Your code seems overly complicated. The following two functions define the negative log-likelihood and negative score vector for a logistic regression with the logit link:
logLik_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- sum(dbinom(y, size = 1, prob = plogis(eta), log = TRUE))
}
score_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- crossprod(X, y - plogis(eta))
}
Then you can use it as follows:
# load the data
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# fit with optim()
opt1 <- optim(c(-1, 1, -1), logLik_Bin, score_Bin, method = "BFGS",
y = mydata$admit, X = cbind(1, mydata$gre, mydata$gpa))
opt1$par
# compare with glm()
glm(admit ~ gre + gpa, data = mydata, family = binomial())
Typically, for well-behaved covariates (i.e., expecting to have a coefficients in the interval [-4 to 4]), starting at 0 is a good idea.
Given:
set.seed(1001)
outcome<-rnorm(1000,sd = 1)
covariate<-rnorm(1000,sd = 1)
log-likelihood of normal pdf:
loglike <- function(par, outcome, covariate){
cov <- as.matrix(cbind(1, covariate))
xb <- cov * par
(- 1/2* sum((outcome - xb)^2))
}
optimize:
opt.normal <- optim(par = 0.1,fn = loglike,outcome=outcome,cov=covariate, method = "BFGS", control = list(fnscale = -1),hessian = TRUE)
However I get different results when running an simple OLS. However maximizing log-likelihhod and minimizing OLS should bring me to a similar estimate. I suppose there is something wrong with my optimization.
summary(lm(outcome~covariate))
Umm several things... Here's a proper working likelihood function (with names x and y):
loglike =
function(par,x,y){cov = cbind(1,x); xb = cov %*% par;(-1/2)*sum((y-xb)^2)}
Note use of matrix multiplication operator.
You were also only running it with one par parameter, so it was not only broken because your loglike was doing element-element multiplication, it was only returning one value too.
Now compare optimiser parameters with lm coefficients:
opt.normal <- optim(par = c(0.1,0.1),fn = loglike,y=outcome,x=covariate, method = "BFGS", control = list(fnscale = -1),hessian = TRUE)
opt.normal$par
[1] 0.02148234 -0.09124299
summary(lm(outcome~covariate))$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02148235 0.03049535 0.7044466 0.481319029
covariate -0.09124299 0.03049819 -2.9917515 0.002842011
shazam.
Helpful hints: create data that you know the right answer for - eg x=1:10; y=rnorm(10)+(1:10) so you know the slope is 1 and the intercept 0. Then you can easily see which of your things are in the right ballpark. Also, run your loglike function on its own to see if it behaves as you expect.
Maybe you will find it usefull to see the difference between these two methods from my code. I programmed it the following way.
data.matrix <- as.matrix(hprice1[,c("assess","bdrms","lotsize","sqrft","colonial")])
loglik <- function(p,z){
beta <- p[1:5]
sigma <- p[6]
y <- log(data.matrix[,1])
eps <- (y - beta[1] - z[,2:5] %*% beta[2:5])
-nrow(z)*log(sigma)-0.5*sum((eps/sigma)^2)
}
p0 <- c(5,0,0,0,0,2)
m <- optim(p0,loglik,method="BFGS",control=list(fnscale=-1,trace=10),hessian=TRUE,z=data.matrix)
rbind(m$par,sqrt(diag(solve(-m$hessian))))
And for the lm() method I find this
m.ols <- lm(log(assess)~bdrms+lotsize+sqrft+colonial,data=hprice1)
summary(m.ols)
Also if you would like to estimate the elasticity of assessed value with respect to the lotsize or calculate a 95% confidence interval
for this parameter, you could use the following
elasticity.at.mean <- mean(hprice1$lotsize) * m$par[3]
var.coefficient <- solve(-m$hessian)[3,3]
var.elasticity <- mean(hprice1$lotsize)^2 * var.coefficient
# upper bound
elasticity.at.mean + qnorm(0.975)* sqrt(var.elasticity)
# lower bound
elasticity.at.mean + qnorm(0.025)* sqrt(var.elasticity)
A more simple example of the optim method is given below for a binomial distribution.
loglik1 <- function(p,n,n.f){
n.f*log(p) + (n-n.f)*log(1-p)
}
m <- optim(c(pi=0.5),loglik1,control=list(fnscale=-1),
n=73,n.f=18)
m
m <- optim(c(pi=0.5),loglik1,method="BFGS",hessian=TRUE,
control=list(fnscale=-1),n=73,n.f=18)
m
pi.hat <- m$par
numerical calculation of s.d
rbind(pi.hat=pi.hat,sd.pi.hat=sqrt(diag(solve(-m$hessian))))
analytical
rbind(pi.hat=18/73,sd.pi.hat=sqrt((pi.hat*(1-pi.hat))/73))
Or this code for the normal distribution.
loglik1 <- function(p,z){
mu <- p[1]
sigma <- p[2]
-(length(z)/2)*log(sigma^2) - sum(z^2)/(2*sigma^2) +
(mu*sum(z)/sigma^2) - (length(z)*mu^2)/(2*sigma^2)
}
m <- optim(c(mu=0,sigma2=0.1),loglik1,
control=list(fnscale=-1),z=aex)
I'm trying to fit two gaussian peaks to my density plot data, using the following code:
model <- function(coeffs,x)
{
(coeffs[1] * exp( - ((x-coeffs[2])/coeffs[3])**2 ))
}
y_axis <- data.matrix(den.PA$y)
x_axis <- data.matrix(den.PA$x)
peak1 <- c(1.12e-2,1075,2) # guess for peak 1
peak2 <- c(1.15e-2,1110,2) # guess for peak 2
peak1_fit <- model(peak1,den.PA$x)
peak2_fit <- model(peak2,den.PA$x)
total_peaks <- peak1_fit + peak2_fit
err <- den.PA$y - total_peaks
fit <- nls(y_axis~coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2 ),start=list(coeffs2=1.12e-2, coeffs3=1075, coeffs4=2))
fit2<- nls(y_axis~coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2 ),start=list(coeffs2=1.15e-2, coeffs3=1110, coeffs4=2))
fit_coeffs = coef(fit)
fit2_coeffs = coef(fit2)
a <- model(fit_coeffs,den.PA$x)
b <- model(fit2_coeffs,den.PA$x)
plot(den.PA, main="Cytochome C PA", xlab= expression(paste("Collision Cross-Section (", Å^2, ")")))
lines(results2,a, col="red")
lines(results2,b, col="blue")
This gives me the following plot:
This is where I have my problem. I calculate the fits independently of each other and gaussian peaks are overlaid on on top of each other. I need to feed the err variable into nls which should return 6 coeffs from which I can then re-model the gaussian peaks to fit to the plot.
The answer came to me as soon as i Posted the question. Changing fit to this:
fit <- nls(y_axis~(coeffs2 * exp( - ((x_axis-coeffs3)/coeffs4)**2)) + (coeffs5 * exp( - ((x_axis-coeffs6)/coeffs7)**2)), start=list(coeffs2=1.12e-2, coeffs3=1075, coeffs4=2,coeffs5=1.15e-2, coeffs6=1110, coeffs7=2))
Gives:
An inelegant soloution but it does the job.
Assume A follows Exponential distribution; B follows Gamma distribution
How to plot the PDF of 0.5*(A+B)
This is fairly straight forward using the "distr" package:
library(distr)
A <- Exp(rate=3)
B <- Gammad(shape=2, scale=3)
conv <- 0.5*(A+B)
plot(conv)
plot(conv, to.draw.arg=1)
Edit by JD Long
Resulting plot looks like this:
If you're just looking for fast graph I usually do the quick and dirty simulation approach. I do some draws, slam a Gaussian density on the draws and plot that bad boy:
numDraws <- 1e6
gammaDraws <- rgamma(numDraws, 2)
expDraws <- rexp(numDraws)
combined <- .5 * (gammaDraws + expDraws)
plot(density(combined))
output should look a little like this:
Here is an attempt at doing the convolution (which #Jim Lewis refers to) in R. Note that there are probably much more efficient ways of doing this.
lower <- 0
upper <- 20
t <- seq(lower,upper,0.01)
fA <- dexp(t, rate = 0.4)
fB <- dgamma(t,shape = 8, rate = 2)
## C has the same distribution as (A + B)/2
dC <- function(x, lower, upper, exp.rate, gamma.rate, gamma.shape){
integrand <- function(Y, X, exp.rate, gamma.rate, gamma.shape){
dexp(Y, rate = exp.rate)*dgamma(2*X-Y, rate = gamma.rate, shape = gamma.shape)*2
}
out <- NULL
for(ix in seq_along(x)){
out[ix] <-
integrate(integrand, lower = lower, upper = upper,
X = x[ix], exp.rate = exp.rate,
gamma.rate = gamma.rate, gamma.shape = gamma.shape)$value
}
return(out)
}
fC <- dC(t, lower=lower, upper=upper, exp.rate=0.4, gamma.rate=2, gamma.shape=8)
## plot the resulting distribution
plot(t,fA,
ylim = range(fA,fB,na.rm=TRUE,finite = TRUE),
xlab = 'x',ylab = 'f(x)',type = 'l')
lines(t,fB,lty = 2)
lines(t,fC,lty = 3)
legend('topright', c('A ~ exp(0.4)','B ~ gamma(8,2)', 'C ~ (A+B)/2'),lty = 1:3)
I'm not an R programmer, but it might be helpful to know that for independent random variables with PDFs f1(x) and f2(x), the PDF
of the sum of the two variables is given by the convolution f1 * f2 (x) of the two input PDFs.