R gmm package using exactly identified moment conditions - r

For exactly identified moments, GMM results should be the same regardless of initial starting values. This doesn't appear to be the case however.
library(gmm)
data(Finance)
x <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"])
# want to solve for coefficients theta[1], theta[2] in exactly identified
# system
g <- function(theta, x)
{
m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"]
m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"]
f <- cbind(m.1, m.z)
return(f)
}
# gmm coefficient result should be identical to ols regressing rm on rf
# since two moments are E[u]=0 and E[u*rf]=0
model.lm <- lm(rm ~ rf, data=x)
model.lm
# gmm is consistent with lm given correct starting values
summary(gmm(g, x, t0=model.lm$coefficients))
# problem is that using different starting values leads to different
# coefficients
summary(gmm(g, x, t0=rep(0,2)))
Is there something wrong with my setup?

The gmm package author Pierre Chausse was kind enough to respond to my inquiry.
For linear models, he suggests using the formula approach:
gmm(rm ~ rf, ~rf, data=x)
For non-linear models, he emphasizes that the starting values are indeed critical. In the case of exactly identified models, he suggests setting the fnscale to a small number to force the optim minimizer to converge closer to 0. Also, he thinks the BFGS algorithm works better with GMM.
summary(gmm(g, x, t0=rep(0,2), method = "BFGS", control=list(fnscale=1e-8)))
Both solutions work for this example. Thanks Pierre!

Related

How do I perform Non Linear Least Squares in R with a pre determined lag structure

Suppose I want to estimate the parameters of the following model:
$y_t = beta0 (sum_{i=1}^p w(delta;i) x_{t-i})$.
Latex version of the equation: https://i.stack.imgur.com/POOlD.png
Where y_t and x_{t-i} are known data points, wdelta follows an exponential Almon lag structure with two parameters delta1 and delta2(see image). And beta0 is the common parameter.
Generating some data for x and y
y <- seq(1:10)
x <- rnorm(10,2,5)
The literature suggests estimating the model parameters using NLS and the Gaussian Newton Method. R does have a function gaussNewton however I am not sure how to use this. How do I approach the estimation of the parameters beta0,delta1 and delta2?
Wikipedia suggest: https://en.wikipedia.org/wiki/Non-linear_least_squares, however I feel like this is not appropriate in this case.
The nls function in R is unable to deal with predefined lag structures so this is not an option either. Maybe I could write out the function in the form of the sum of squared residuals and use the optim function? Another option could be to use the nlm function.
nonls <- function(delta1,delta2,i,p) {
z <- exp(delta1 * i + delta2 *i)
wdelta[i] <- exp(delta1 * i + delta2 *i)/sum(z[1:i])
ssr <- (y[i]- (beta0 * wdelta[i] * x[i:p]))^2
}
optim(ssr)
I look forward to your suggestions.

Fitting experimental data points to different cumulative distributions using R

I am new to programming and using R software, so I would really appreciate your feedback to the current problem that I am trying to solve.
So, I have to fit a cumulative distribution with some function (two/three parameter function). This seems to be pretty straight-forward task, but I've been buzzing around this now for some time.
Let me show you what are my variables:
x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196)
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999)
This is the plot where I set up x-axis as log:
After some research, I have tried with Sigmoid function, as found on one of the posts (I can't add link since my reputation is not high enough). This is the code:
# sigmoid function definition
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
# fitting code using nonlinear least square
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# get the coefficients using the coef function
params=coef(fitmodel)
# asigning to y2 sigmoid function
y2 <- sigmoid(params,x)
# plotting y2 function
plot(y2,type="l")
# plotting data points
points(y)
This led me to some good fitting results (I don't know how to quantify this). But, when I look at the at the plot of Sigmuid fitting function I don't understand why is the S shape now happening in the range of x-values from 40 until 7 (looking at the S shape should be in x-values from 10 until 200).
Since I couldn't explain this behavior, I thought of trying Weibull equation for fitting, but so far I can't make the code running.
To sum up:
Do you have any idea why is the Sigmoid giving me that weird fitting?
Do you know any better two or three parameter equation for this fitting approach?
How could I determine the goodness of fit? Something like r^2?
# Data
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
# sigmoid function definition
sigmoid = function(x, a, b, c) {
a * exp(-b * exp(-c * x))
}
# fitting code using nonlinear least square
fitmodel <- nls(y ~ sigmoid(x, a, b, c), start=list(a=1,b=.5,c=-2), data = df)
# plotting y2 function
plot(df$x, predict(fitmodel),type="l", log = "x")
# plotting data points
points(df)
The function I used is the Gompertz function and this blog post explains why R² shouldn't be used with nonlinear fits and offers an alternative.
After going through different functions and different data-sets I have found the best solution that gives the answers to all of my questions posted.
The code is as it follows for the data-set stated in question:
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3()) #The Gompertz model G.3()
plot(fm)
#Gompertz Coefficients and residual standard error
summary(fm)
The plot after fitting

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

How to obtain new samples from ZIP or ZINB-model for bayesian p-value

Hopefully someone can help me with this one, because I am really stuck and do not find my coding error!
I am fitting zero-inflated poisson / negative binomial GLMs (no random effects) in JAGS (with R2Jags) and everything is fine with the parameter estimates, priors, initial values and chains convergence. All results are perfectly in line with, e.g., the estimates from the pscl-package, including my calculation of pearson residuals in the model...
The only thing I cannot get to work is to sample from the model a new sample to obtain a bayesian p-value for evaluating the model fit. The "normal" poisson and negative binomial models I fit before all gave the expected replicated samples and no problems occured.
Here's my code so far, but the important part is "#New Samples":
model{
# 1. Priors
beta ~ dmnorm(b0[], B0[,])
aB ~ dnorm(0.001, 1)
#2. Likelihood function
for (i in 1:N){
# Logistic part
W[i] ~ dbern(psi.min1[i])
psi.min1[i] <- 1 - psi[i]
eta.psi[i] <- aB
logit(psi[i]) <- eta.psi[i]
# Poisson part
Y[i] ~ dpois(mu.eff[i])
mu.eff[i] <- W[i] * mu[i]
log(mu[i]) <- max(-20, min(20, eta.mu[i]))
eta.mu[i] <- inprod(beta[], X[i,])
# Discrepancy measures:
ExpY[i] <- mu [i] * (1 - psi[i])
VarY[i] <- (1- psi[i]) * (mu[i] + psi[i] * pow(mu[i], 2))
PRes[i] <- (Y[i] - ExpY[i]) / sqrt(VarY[i])
D[i] <- pow(PRes[i], 2)
# New Samples:
YNew[i] ~ dpois(mu.eff[i])
PResNew[i] <- (YNew[i] - ExpY[i]) / sqrt(VarY[i])
DNew[i] <- pow(PResNew[i], 2)
}
Fit <- sum(D[1:N])
FitNew <- sum(DNew[1:N])
}
The big problem is, that I really tried all combinations and alterations I think could/should work, but when I look at the simulated samples, I get this here:
> all.equal( Jags1$BUGSoutput$sims.list$YNew, Jags1$BUGSoutput$sims.list$Y )
[1] TRUE
And, to make it really weird, when using the means of Fit and FitNew:
> Jags1$BUGSoutput$mean$Fit
[1] 109.7883
> Jags1$BUGSoutput$mean$FitNew
[1] 119.2111
Has anyone a clue what I am doing wrong? Any help would be deeply appreciated!
Kind regards, Ulf
I suspect this isn't the case, but the only obvious reason I can suspect for Y[i] and YNew[i] being always identical is if mu.eff[i] is ~zero, either because W[i] is 0 or mu[i] is close to zero. This implies that Y[] is always zero, which is easy to check from your data, but as I said it does seem odd that you would be trying to model this... Otherwise, I'm not sure what is going on ... try simplifying the code to see if that solves the problem, and then add things back in until it breaks again. Some other suggestions:
It may be helpful for debugging to look at the absolute values of Y and YNew rather than just Y==YNew
If you want a negative binomial (= gamma-Poisson) try sampling mu[i] from a gamma distribution - I have used this formulation for ZINB models extensively, so am sure it works
Your prior for aB looks odd to me - it gives a prior 95% CI for zero inflation around 12-88% - is that what you intended? And why a mean of 0.001 not 0? If you have no predictors then a beta prior for psi.min seems more natural - and if you have no useful prior information a beta(1,1) prior would be an obvious choice.
Minor point but you are calculating a lot of deterministic functions of aB within the for loop - this is going to slow down your model...
Hope that helps,
Matt
So, after getting a nervous breakdown and typing all again and again while searching for my coding error, I found the most stupid error I have ever made - so far:
I just did not specify "Y" as a parameter to save, only "YNew", so when I compared YNew and Y from the sims.list with all.equal, I did not get what I thought I should. I do not know why JAGS gives me the Y at all (from the sims.list of the JAGS-object), but for some reason it is just giving me YNew when asked to give Y. So this part is actually right:
Jags1$BUGSoutput$mean$Fit
[1] 109.7883
Jags1$BUGSoutput$mean$FitNew
[1] 119.2111
So I hope that I did not cause a major confusion for anybody...

Mixture of Gaussian and Gamma distribution

I'm looking for some script/package in R (Python will do too) to find out the component distribution parameters from a mixture of Gaussian and Gamma distributions. I've so far used
the R package "mixtools" to model the data as mixture of Gaussians, but I think it can be better modeled by Gamma plus Gaussian.
Thanks
Here's one possibility:
Define utility functions:
rnormgammamix <- function(n,shape,rate,mean,sd,prob) {
ifelse(runif(n)<prob,
rgamma(n,shape,rate),
rnorm(n,mean,sd))
}
(This could be made a little bit more efficient ...)
dnormgammamix <- function(x,shape,rate,mean,sd,prob,log=FALSE) {
r <- prob*dgamma(x,shape,rate)+(1-prob)*dnorm(x,mean,sd)
if (log) log(r) else r
}
Generate fake data:
set.seed(101)
r <- rnormgammamix(1000,1.5,2,3,2,0.5)
d <- data.frame(r)
Approach #1: bbmle package. Fit shape, rate, standard deviation on log scale, prob on logit scale.
library("bbmle")
m1 <- mle2(r~dnormgammamix(exp(logshape),exp(lograte),mean,exp(logsd),
plogis(logitprob)),
data=d,
start=list(logshape=0,lograte=0,mean=0,logsd=0,logitprob=0))
cc <- coef(m1)
png("normgam.png")
par(bty="l",las=1)
hist(r,breaks=100,col="gray",freq=FALSE)
rvec <- seq(-2,8,length=101)
pred <- with(as.list(cc),
dnormgammamix(rvec,exp(logshape),exp(lograte),mean,
exp(logsd),plogis(logitprob)))
lines(rvec,pred,col=2,lwd=2)
true <- dnormgammamix(rvec,1.5,2,3,2,0.5)
lines(rvec,true,col=4,lwd=2)
dev.off()
tcc <- with(as.list(cc),
c(shape=exp(logshape),
rate=exp(lograte),
mean=mean,
sd=exp(logsd),
prob=plogis(logitprob)))
cbind(tcc,c(1.5,2,3,2,0.5))
The fit is reasonable, but the parameters are fairly far off -- I think this model isn't very strongly identifiable in this parameter regime (i.e., the Gamma and gaussian components can be swapped)
library("MASS")
ff <- fitdistr(r,dnormgammamix,
start=list(shape=1,rate=1,mean=0,sd=1,prob=0.5))
cbind(tcc,ff$estimate,c(1.5,2,3,2,0.5))
fitdistr gets the same result as mle2, which suggests we're
in a local minimum. If we start from the true parameters we get
to something reasonable and near the true parameters.
ff2 <- fitdistr(r,dnormgammamix,
start=list(shape=1.5,rate=2,mean=3,sd=2,prob=0.5))
-logLik(ff2) ## 1725.994
-logLik(ff) ## 1755.458

Resources