How to write this function for mixture model in R? - r

I want to want to estimate a model in R.
One of its part is a finite mixture model which is consisted of two OLS.
As a freshman in R, I don't know how to write this probability density function in R.
I wonder if you can give some help.
The probability density function is as following:
f(y|x)=(p/σ1)*φ(y-x*b1/σ1)+((1-p)/σ2)*φ(y-x*b2/σ2)
I have used stata to write a example:
gen double f1'=normalden($ML_y1,xb1',exp(lns1'))
gen doublef2'=normalden($ML_y1,xb2',exp(lns2'))
tempvar p
gen double p'=exp(lp')/(1+exp(lp'))
replacelnf'=ln(p'*f1'+(1-p')*f2')
I wonder if you can show me how to write this function in R.
Thanks a lot and I am looking forward to your help

See the function FLXMRglm. The density is estimated with dnorm
library(flexmix)
FLXMRglm
# your case
if (family == "gaussian") {
z#defineComponent <- function(para) {
predict <- function(x, ...) {
dotarg = list(...)
if ("offset" %in% names(dotarg))
offset <- dotarg$offset
p <- x %*% para$coef
if (!is.null(offset))
p <- p + offset
p
}
logLik <- function(x, y, ...) dnorm(y, mean = predict(x,
...), sd = para$sigma, log = TRUE)
new("FLXcomponent", parameters = list(coef = para$coef,
sigma = para$sigma), logLik = logLik, predict = predict,
df = para$df)
}
z#fit <- function(x, y, w, component) {
fit <- lm.wfit(x, y, w = w, offset = offset)
z#defineComponent(para = list(coef = coef(fit), df = ncol(x) +
1, sigma = sqrt(sum(fit$weights * fit$residuals^2/mean(fit$weights))/(nrow(x) -
fit$rank))))
}
}

Related

Fitting probit model inr R

For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!
The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.

Negative Binomial regression manually

I want to do a Negative Binomial regression manually and define a function that can be used for estimation of an arbitrary number of coefficients. I have
How can I get a matrix of betas and p-values and don't have to write every beta explicity.
For poisson regression I was succesfull as see the below code:
daten<-warpbreaks
LogLike <- function(y,x, par) {
beta <- par
lambda <- exp(beta%*%t(x))
LL <- -sum(dpois(y, lambda, log = TRUE))
return(LL)
}
PoisMod<-function(formula, data){
form <- formula(formula)
model <- model.frame(formula, data = data)
x <- model.matrix(formula,data = data)
y <- model.response(model)
par <- rep(0,ncol(x))
erg <- list(optim(par=par,fn=LogLike,x=x,y=y)$par)
return(erg)
}
PoisMod(breaks~wool+tension, as.data.frame(daten))
glm(breaks~wool+tension, family = "poisson", data = as.data.frame(daten))
In case of Negative Binomial, how to write this part
daten<-warpbreaks
LogLike <- function(y,x, par) {
beta <- par
lambda <- exp(beta%*%t(x))
LL <- -sum(dpois(y, lambda, log = TRUE))
return(LL)
}

Logistic Regression in R: Optimization Issues concerning Initial Guess

I need to implement a logistic regression manually, using the Score/GMM approach, without the use of GLM. This is because at later stages the model will be much more complicated. Currently I am running into a problem where for the logistic regression, the optimization procedures are very initial point dependent.To illustrate, here is my code using an online dataset. More details about the procedure are in the comments:
library(data,table)
library(nleqslv)
library(Matrix)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
data_analysis<-data.table(mydata)
data_analysis[,constant:=1]
#Likelihood function for logit
#The logistic regression will regress the binary variable
#admit on a constant and the variable gpa
LL <- function(beta){
beta=as.numeric(beta)
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
llf <- sum(data_temp$admit * log(choice_prob)) + (sum((one-data_temp$admit) * log(one-choice_prob)))
return(-1*llf)
}
#Score to be used when optimizing using LL
#Identical to the Score function below but returns negative output
Score_LL <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = h/(1+h)
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(-1*as.numeric(score_final2))
}
#The Score/Deriv/Jacobian of the Likelihood function
Score <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
choice_prob = as.numeric(h/(1+h))
resid = as.numeric(data_temp$admit - choice_prob)
score_final2 = t(mat_temp2) %*% Diagonal(length(resid), x=resid) %*% one
return(as.numeric(score_final2))
}
#Derivative of the Score function
Score_Deriv <- function(beta){
data_temp=data_analysis
mat_temp2 = cbind(data_temp$constant,
data_temp$gpa)
one = rep(1,dim(mat_temp2)[1])
h = exp(beta %*% t(mat_temp2))
weight = (h/(1+h)) * (1- (h/(1+h)))
weight_mat = Diagonal(length(weight), x=weight)
deriv = t(mat_temp2)%*%weight_mat%*%mat_temp2
return(-1*as.array(deriv))
}
#Quadratic Gain function
#Minimized at Score=0 and so minimizing is equivalent to solving the
#FOC of the Likelihood. This is the GMM approach.
Quad_Gain<- function(beta){
h=Score(as.numeric(beta))
return(sum(h*h))
}
#Derivative of the Quadratic Gain function
Quad_Gain_deriv <- function(beta){
return(2*t(Score_Deriv(beta))%*%Score(beta))
}
sol1=glm(admit ~ gpa, data = data_analysis, family = "binomial")
sol2=optim(c(2,2),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
sol3=optim(c(0,0),Quad_Gain,gr=Quad_Gain_deriv,method="BFGS")
When I run this code, I get that sol3 matches what glm produces (sol1) but sol2, with a different initial point, differs from the glm solution by a lot. This is something happening in my main code with the actual data as well. One solution is to create a grid and test multiple starting points. However, my main data set has 10 parameters and this would make the grid very large and the program computationally infeasible. Is there a way around this problem?
Your code seems overly complicated. The following two functions define the negative log-likelihood and negative score vector for a logistic regression with the logit link:
logLik_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- sum(dbinom(y, size = 1, prob = plogis(eta), log = TRUE))
}
score_Bin <- function (betas, y, X) {
eta <- c(X %*% betas)
- crossprod(X, y - plogis(eta))
}
Then you can use it as follows:
# load the data
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# fit with optim()
opt1 <- optim(c(-1, 1, -1), logLik_Bin, score_Bin, method = "BFGS",
y = mydata$admit, X = cbind(1, mydata$gre, mydata$gpa))
opt1$par
# compare with glm()
glm(admit ~ gre + gpa, data = mydata, family = binomial())
Typically, for well-behaved covariates (i.e., expecting to have a coefficients in the interval [-4 to 4]), starting at 0 is a good idea.

Estimated inverse function is not continous

I estimated the cdf of my density in an interval of length 0.03 with 10k points. Even though my cdf is pretty smooth, my inverse of the cdf isn't smooth at all. Here the interval of length 1 is also evaluated with 10k points.
See: Estimated CDF & Estimated Inverse CDF
For the Inverse CDF I use:
x = seq(from = 0, to = 1, length = 10000)
F_hat_inv_given_x = function(y){
uniroot(function(x){
F_hat_given_x(x)-y
},interval=c(0.065, 0.095))$root
}
F_hat_inv_given_x = Vectorize(F_hat_inv_given_x)
with F_hat_given_x defined as:
F_hat_given_x = function(y) {
integrate(f = f_hat_given_x, min(y_data), y)$value
}
F_hat_given_x <- Vectorize(F_hat_given_x)
where f_hat_given_x is my density:
f_hat_given_x = function(y){
tapply(y, x = x_sample, FUN = f_hat, INDEX = 1:length(y))
}
which is a conditional density f(y | X = x) for a given sample X = x_sample:
f_hat = function(x, y){
(sum(K(abs(x-x_data)/H_n) * (K(abs(y-y_data)/h_n)))) / (h_n * sum(K(abs(x-x_data)/H_n)))
}
with K being the gaussian kernel, h_n, H_n are bandwidths computed via npcdensbw and x_data and y_data are my given data on which I evaluate the conditional density f_hat(x,y)
Any idea why the inverse function isn't monotone increasing like the inverse should? What is causing the inaccuracity?

Extract coefficients from penalized regression

I am using the sparse group lasso, which is a penalized regression. The package I am using is SGL. I tried to run the examples in my R, and the code is given as below
set.seed(1)
n = 50; p = 100; size.groups = 10
index <- ceiling(1:p / size.groups)
X = matrix(rnorm(n * p), ncol = p, nrow = n)
beta = (-2:2)
y = X[,1:5] %*% beta + 0.1*rnorm(n)
data = list(x = X, y = y)
cvFit = cvSGL(data, index, type = "linear")
I tried to extract the regression coefficient of cvFit, but it turns out to be
coef(cvFit)
NULL
Can anyone tell me what is wrong? Thanks in advance.
This extracts the coefficients from the model with the minimum lambda value.
coef(fit,s=cvfit$lambda.min)

Resources