Why does MLE for mixture model not equal to flexmix? - r

I want to write a mle for finite mixture model in R,but coefficients estimated by model are not same as coefficients estimated by package flexmix. I wonder if you can point out my mistakes.
my code is as following:
#prepare data
slope1 <- -.3;slope2 <- .3;slope3 <- 1.8; slope4 <- 0.5;intercept1 <- 1.5
age <- sample(seq(18,60,len=401), 200)
grade <- sample(seq(0,100,len=401), 200)
not_smsa <- sample(seq(-2,2,len=401), 200)
unemployment <- rnorm(200,mean=0,sd=1)
wage <- intercept1 + slope1*age +slope2*grade + slope3*not_smsa + rnorm(length(age),0,.15)
y <- wage
X <- cbind(1, age , grade , not_smsa)
mydata <- cbind.data.frame(X,y)
anso <- lm(wage ~ age + grade + not_smsa,
data = mydata)
vi <- c(coef(anso),0.01,0.02,0.03,0.04,0.1)
#function
fmm <- function(beta) {
mu1 <- c(X %*% beta[1:4])
mu2 <- c(X %*% beta[5:8])
p1 <- 1 / (1 + exp(-beta[9]))
p2 <- 1-p1
llk <- p1*dnorm(y,mu1)+p2*dnorm(y,mu2)
-sum(log(llk),na.rm=T)
}
fit <- optim(vi,fmm , method = "BFGS", control = list(maxit=50000), hessian = TRUE)
fit$par
library(flexmix)
flexfit <- flexmix(wage ~ age + grade + not_smsa, data = mydata, k = 2)
flexfit$par
c1 <- parameters(flexfit,component=1)
c2 <- parameters(flexfit, component=2)
Are there any mistakes esisted in my code?

I have solved mistakes esisted in my code,parameters of main function should be added some constraints.
fmm <- function(pars) {
beta1 = pars[1:4]
sigma1 = log(1 + exp(pars[4]))
beta2 = pars[6:10]
sigma2 = log(1 + exp(pars[11]))
p1 = 1 / (1 + exp(-pars[12]))
mu1 <- c(X %*% beta1)
mu2 <- c(X %*% beta2)
p2 <- 1-p1
llk <- p1*dnorm(y,mu1,sigma1)+p2*dnorm(y,mu2,sigma2)
-sum(log(llk),na.rm=T)
}

Related

How to estimate the Kalman Filter with 'KFAS' R package, with an AR(1) transition equation and covariates?

I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = b_0 + b_1xx_t + Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t),
where xx_t are covariates. I have read this question and wrote the following code
library(KFAS)
set.seed(100)
xx <- rnorm(200)
beta0 <- 0.1
beta1 <- 0.1
eps <- rt(200, 4, 1)
y <- as.matrix(beta0 + beta1*xx + (arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(y ~ xx + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
I get the error
Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)), :
System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07
I have tried to change the initial vector to c(1, 0.5, 1, 1, 1) but it returns the same message. Does anyone know how can I do this?
Thanks!

Add a Passing-Bablok regression line

I have to perform many comparisons between different measurement methods and I have to use the Passing-Bablok regression approach.
I would like to take advantage of ggplot2 and faceting, but I don't know how to add a geom_smooth layer based on the Passing-Bablok regression.
I was thinking about something like: https://stackoverflow.com/a/59173260/2096356
Furthermore, I would also need to show the regression line equation, with confidence interval for intercept and slope parameters, in each plot.
Edit with partial solution
I've found a partial solution combining the code provided in this post and in this answer.
## Regression algorithm
passing_bablok.fit <- function(x, y) {
x_name <- deparse(substitute(x))
lx <- length(x)
l <- lx*(lx - 1)/2
k <- 0
S <- rep(NA, lx)
for (i in 1:(lx - 1)) {
for (j in (i + 1):lx) {
k <- k + 1
S[k] <- (y[i] - y[j])/(x[i] - x[j])
}
}
S.sort <- sort(S)
N <- length(S.sort)
neg <- length(subset(S.sort,S.sort < 0))
K <- floor(neg/2)
if (N %% 2 == 1) {
b <- S.sort[(N+1)/2+K]
} else {
b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
}
a <- median(y - b * x)
res <- as.vector(c(a,b))
names(res) <- c("(Intercept)", x_name)
class(res) <- "Passing_Bablok"
res
}
## Computing confidence intervals
passing_bablok <- function(formula, data, R = 100, weights = NULL){
ret <- boot::boot(
data = model.frame(formula, data),
statistic = function(data, ind) {
data <- data[ind, ]
args <- rlang::parse_exprs(colnames(data))
names(args) <- c("y", "x")
rlang::eval_tidy(rlang::expr(passing_bablok.fit(!!!args)), data, env = rlang::current_env())
},
R=R
)
class(ret) <- c("Passing_Bablok", class(ret))
ret
}
## Plotting confidence bands
predictdf.Passing_Bablok <- function(model, xseq, se, level) {
pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
if(se) {
preds <- tcrossprod(model$t, cbind(1, xseq))
data.frame(
x = xseq,
y = pred,
ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
)
} else {
return(data.frame(x = xseq, y = pred))
}
}
An example of usage:
z <- data.frame(x = rnorm(100, mean = 100, sd = 5),
y = rnorm(100, mean = 110, sd = 8))
ggplot(z, aes(x, y)) +
geom_point() +
geom_smooth(method = passing_bablok) +
geom_abline(slope = 1, intercept = 0)
So far, I haven't been able to show the regression line equation, with confidence interval for intercept and slope parameters (as +- or in parentheses).
You've arguably done with difficult part with the PaBa regression.
Here's a basic solution using your passing_bablok.fit function:
z <- data.frame(x = 101:200+rnorm(100,sd=10),
y = 101:200+rnorm(100,sd=8))
mycoefs <- as.numeric(passing_bablok.fit(x = z$x, y=z$y))
paba_eqn <- function(thecoefs) {
l <- list(m = format(thecoefs[2], digits = 2),
b = format(abs(thecoefs[1]), digits = 2))
if(thecoefs[1] >= 0){
eq <- substitute(italic(y) == m %.% italic(x) + b,l)
} else {
eq <- substitute(italic(y) == m %.% italic(x) - b,l)
}
as.character(as.expression(eq))
}
library(ggplot2)
ggplot(z, aes(x, y)) +
geom_point() +
geom_smooth(method = passing_bablok) +
geom_abline(slope = 1, intercept = 0) +
annotate("text",x = 110, y = 220, label = paba_eqn(mycoefs), parse = TRUE)
Note the equation will vary because of rnorm in the data creation..
The solution could definitely be made more slick and robust, but it works for both positive and negative intercepts.
Equation concept sourced from: https://stackoverflow.com/a/13451587/2651663

Calculating RSS manually with given pairs of beta0 and beta1

I am trying to manually calculate the RSS for a dataset with given pairs of beta0 and beta1. For each (beta_0,beta_1) pair of values, I need to calculate the residual sum of squares. Store it as a vector in data called RSS. Here's the code provided.
x = pinotnoir$Aroma
y = pinotnoir$Quality
fit = lm(y ~ x)
summary(fit)
b0s <- seq(0, 10, .1)
b1s <- seq(0, 4, .01)
data <- expand.grid(beta0=b0s, beta1=b1s)
Here's what I have so far. I think the residual calculation is wrong but I'm not sure how to fix it.
rows = length(b1s)
rsd <- rep(NA,rows)
for (i in 1:rows){
residual = (y - (b0s[i] + b1s[i] * x))^2
rsd[i] <- residual
}
data <- expand.grid(beta0=b0s, beta1=b1s, RSS=rsd)
Any help would be appreciated. Thanks in advance!
I am not sure this is exactly what you aim but adapting your code slightly you can get the sum of squared residuals and which betas minimizes them. (using mtcars data for the example)
mtcars
x = mtcars$drat
y = mtcars$wt
(fit = lm(y ~ x))
summary(fit)
grid_len <- 20
b0s <- seq(5, 10, length.out = grid_len)
b1s <- seq(-3, -1, length.out = grid_len)
(data <- expand.grid(beta0=b0s, beta1=b1s))
rows = nrow(data)
resids <- rep(NA,rows)
for (i in 1:rows) {
fitted <- (data$beta0[i] + (data$beta1[i] * x))
squared_resid <- (y - fitted)^2
SSR <- sum(squared_resid)
resids[i] <- SSR
cat(i, ": ", SSR, "\n")
}
data[which.min(resids), ]
fit
results:
> data[which.min(resids), ]
beta0 beta1
332 7.894737 -1.315789
> fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.906 -1.304

Why do MASS:lm.ridge coefficents differ from those calculated manually?

When performing ridge regression manually, as it is defined
solve(t(X) %*% X + lbd*I) %*%t(X) %*% y
I get different results from those calculated by MASS::lm.ridge. Why? For ordinary linear regression the manual method (computing the pseudoinverse) works fine.
Here is my Minimal, Reproducible Example:
library(tidyverse)
ridgeRegression = function(X, y, lbd) {
Rinv = solve(t(X) %*% X + lbd*diag(ncol(X)))
t(Rinv %*% t(X) %*% y)
}
# generate some data:
set.seed(0)
tb1 = tibble(
x0 = 1,
x1 = seq(-1, 1, by=.01),
x2 = x1 + rnorm(length(x1), 0, .1),
y = x1 + x2 + rnorm(length(x1), 0, .5)
)
X = as.matrix(tb1 %>% select(x0, x1, x2))
# sanity check: force ordinary linear regression
# and compare it with the built-in linear regression:
ridgeRegression(X, tb1$y, 0) - coef(summary(lm(y ~ x1 + x2, data=tb1)))[, 1]
# looks the same: -2.94903e-17 1.487699e-14 -2.176037e-14
# compare manual ridge regression to MASS ridge regression:
ridgeRegression(X, tb1$y, 10) - coef(MASS::lm.ridge(y ~ x0 + x1 + x2 - 1, data=tb1, lambda = 10))
# noticeably different: -0.0001407148 0.003689412 -0.08905392
MASS::lm.ridge scales the data before modelling - this accounts for the difference in the coefficients.
You can confirm this by checking the function code by typing MASS::lm.ridge into the R console.
Here is the lm.ridge function with the scaling portion commented out:
X = as.matrix(tb1 %>% select(x0, x1, x2))
n <- nrow(X); p <- ncol(X)
#Xscale <- drop(rep(1/n, n) %*% X^2)^0.5
#X <- X/rep(Xscale, rep(n, p))
Xs <- svd(X)
rhs <- t(Xs$u) %*% tb1$y
d <- Xs$d
lscoef <- Xs$v %*% (rhs/d)
lsfit <- X %*% lscoef
resid <- tb1$y - lsfit
s2 <- sum(resid^2)/(n - p)
HKB <- (p-2)*s2/sum(lscoef^2)
LW <- (p-2)*s2*n/sum(lsfit^2)
k <- 1
dx <- length(d)
div <- d^2 + rep(10, rep(dx,k))
a <- drop(d*rhs)/div
dim(a) <- c(dx, k)
coef <- Xs$v %*% a
coef
# x0 x1 x2
#[1,] 0.01384984 0.8667353 0.9452382

Extract prediction band from lme fit

I have following model
x <- rep(seq(0, 100, by=1), 10)
y <- 15 + 2*rnorm(1010, 10, 4)*x + rnorm(1010, 20, 100)
id <- NULL
for(i in 1:10){ id <- c(id, rep(i,101)) }
dtfr <- data.frame(x=x,y=y, id=id)
library(nlme)
with(dtfr, summary( lme(y~x, random=~1+x|id, na.action=na.omit)))
model.mx <- with(dtfr, (lme(y~x, random=~1+x|id, na.action=na.omit)))
pd <- predict( model.mx, newdata=data.frame(x=0:100), level=0)
with(dtfr, plot(x, y))
lines(0:100, predict(model.mx, newdata=data.frame(x=0:100), level=0), col="darkred", lwd=7)
with predict and level=0 i can plot the mean population response. How can I extract and plot the 95% confidence intervals / prediction bands from the nlme object for the whole population?
Warning: Read this thread on r-sig-mixed models before doing this. Be very careful when you interpret the resulting prediction band.
From r-sig-mixed models FAQ adjusted to your example:
set.seed(42)
x <- rep(0:100,10)
y <- 15 + 2*rnorm(1010,10,4)*x + rnorm(1010,20,100)
id<-rep(1:10,each=101)
dtfr <- data.frame(x=x ,y=y, id=id)
library(nlme)
model.mx <- lme(y~x,random=~1+x|id,data=dtfr)
#create data.frame with new values for predictors
#more than one predictor is possible
new.dat <- data.frame(x=0:100)
#predict response
new.dat$pred <- predict(model.mx, newdata=new.dat,level=0)
#create design matrix
Designmat <- model.matrix(eval(eval(model.mx$call$fixed)[-2]), new.dat[-ncol(new.dat)])
#compute standard error for predictions
predvar <- diag(Designmat %*% model.mx$varFix %*% t(Designmat))
new.dat$SE <- sqrt(predvar)
new.dat$SE2 <- sqrt(predvar+model.mx$sigma^2)
library(ggplot2)
p1 <- ggplot(new.dat,aes(x=x,y=pred)) +
geom_line() +
geom_ribbon(aes(ymin=pred-2*SE2,ymax=pred+2*SE2),alpha=0.2,fill="red") +
geom_ribbon(aes(ymin=pred-2*SE,ymax=pred+2*SE),alpha=0.2,fill="blue") +
geom_point(data=dtfr,aes(x=x,y=y)) +
scale_y_continuous("y")
p1
Sorry for coming back to such an old topic, but this might address a comment here:
it would be nice if some package could provide this functionality
This functionality is included in the ggeffects-package, when you use type = "re" (which will then include the random effect variances, not only residual variances, which is - however - the same in this particular example).
library(nlme)
library(ggeffects)
x <- rep(seq(0, 100, by = 1), 10)
y <- 15 + 2 * rnorm(1010, 10, 4) * x + rnorm(1010, 20, 100)
id <- NULL
for (i in 1:10) {
id <- c(id, rep(i, 101))
}
dtfr <- data.frame(x = x, y = y, id = id)
m <- lme(y ~ x,
random = ~ 1 + x | id,
data = dtfr,
na.action = na.omit)
ggpredict(m, "x") %>% plot(rawdata = T, dot.alpha = 0.2)
ggpredict(m, "x", type = "re") %>% plot(rawdata = T, dot.alpha = 0.2)
Created on 2019-06-18 by the reprex package (v0.3.0)

Resources