KeyError: 0 when calculate likelihood in EM algorithm - jupyter-notebook

I have a model EM algorithm. The model should select random columns and then calculate the likelihood. For example, the dataset has 39 attributes, the model selects 19 attributes only and displays likelihood.
define several functions :
initialize_model_params, expected_hidden_vars , update_model, likelihood
the problem when calculating likelihood I got the error
KeyError: 0
It's related to the index
function likelihood
def calc_likelihood(data, means, covariances, weights):
n_samples, n_features = data.shape
n_components = len(means)
# Initialize likelihood to 0
likelihood = 0
# Loop over each sample in the data
for i in range(n_samples):
sample_likelihood = 0
# Loop over each component in the mixture
for j in range(n_components):
# Calculate the probability of the sample given the component
component_prob = weights[j] * multivariate_normal.pdf(data[i], mean=means[j], cov=covariances[j])
# Add the component probability to the sample likelihood
sample_likelihood += component_prob
# Add the sample likelihood to the overall likelihood
likelihood += np.log(sample_likelihood)
return likelihood
function em
def em(data, columns, num_iterations):
# Initialize the model parameters
model_params = initialize_model_params(data, columns)
# Select a random subset of columns
selected_columns = random.sample(list(columns), k=len(columns) // 2)
print(selected_columns)
for i in range(num_iterations):
# E step: calculate the expected value of the hidden variables using the selected columns
expected_hidden_vars = calculate_expected_hidden_vars(data, selected_columns, model_params)
# M step: update the estimates of the model parameters using the selected columns and the expected hidden variables
model_params = update_model_params(data, selected_columns, expected_hidden_vars)
means, covariances, weights = model_params
print('params', model_params)
# Calculate the likelihood of the data given the current model parameters
likelihood = calc_likelihood(data, means, covariances, weights)
print('likelihood : ' , likelihood)
# Return the maximum likelihood estimates of the model parameters
return model_params
how to fix it. can anyone help me?

Related

Test for Poisson residuals in the analysis of variance model

I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.

Extract AIC from FitARMA in R

I apply FitARMA function from package FitARMA to certain time serie and I get the following result:
> model <- FitARMA(ts, c(1,0,1))
> model
ARIMA(1,0,1)
length of series = 1593 , number of parameters = 3
loglikelihood = 5113 , aic = -10220 , bic = -10203.9
I want to extract aic to a variable. However there is no aic in model details (screen with model details) neither any information about it in the package documentation.
Is there any possibility to do sth like model_aic <- model$aic since I want to do for loop for different p, q orders of ARMA, therefore I would like to extract aic to a variable instead of typing it from the console manually?
One way is to create a function that computes the AIC of the FitARMA model
library(FitARMA)
model <- FitARMA(AirPassengers, c(1,0,1))
model
ARIMA(1,0,1)
length of series = 144 , number of parameters = 3
loglikelihood = -496.55 , aic = 999.1 , bic = 1008
AICFitARMA <- function(model){
k <- nrow(coef.FitARMA(model))
AIC <- 2 * k - 2 * model$loglikelihood
return(AIC)
}
AICFitARMA(model)
[1] 999.0944

Using a loop to calculate BIC with high dimensional data in R studio? (my code keeps giving me errors)

I am working with a large high dimensional data set (So P>N). I am attempting to use BIC for model selection. Here is what I am doing in R studio:
X is my predictor matrix and Y is my outcome vector.
fit <- glmnet(X,Y,alpha=1) #finding LASSO, find 100 lambda's
models <- list()
for(i in 1:100) {
models[[i]] = fit
}
BIC(models)
This results in an error which states "Error in UseMethod("logLik") : no applicable method for 'loglik' applied to an object of class "list""
I also attempt to compute BIC while in the loop as follows:
for (i in 1:100){
BIC(models[i])
}
Which gives me the same error.
You can check out this answer on how to calculate BIC, so we first create a function that calculates it based on the log likelihood.
minustwologLik = function(fit){
n=length(fit$residuals)
deviance = sum(fit$residuals^2)
n*(log(deviance/n) + 1 + log(2*pi))
}
BIC_manual = function(fit,n,k){
log(n)*k + minustwologLik(fit)
}
We can test it for normal glms:
set.seed(100)
X = matrix(rnorm(5000),ncol=50)
Y = rnorm(100)
lmf = glm(Y~X)
# k is number of predictors + 2 because
# we have an intercept and we estimate the error
BIC_manual(lmf,nrow(X),ncol(X)+2)
[1] 446.8827
BIC(lmf)
[1] 446.8827
Now we have a working BIC function. Suppose we want to test 10 values of lambda
library(glmnet)
lambda = runif(10,min=0,max=0.1)
models <- list()
for(i in 1:10) {
fit = glmnet(X,Y,alpha=1,lambda=lambda[i])
# we store the residuals
fit$residuals = predict(fit,X)-Y
models[[i]] = fit
}
To calculate BIC, we do:
results=sapply(1:length(models),function(i){
BIC_manual(models[[i]],nrow(X),models[[i]]$df+4)
})
Here I think the k is number of non zero coefficents + 4 , because you have intercept, alpha,beta and error. It does not quite matter if you compare between models, as this will be constant between models. What matters is to get the number of nonzero coefficients.
For MSE, it is the deviance we were calculating. If you have glmnet, you can do for example
deviance.glmnet(models[[1]])
We collect the MSE in a similar way:
MSE=sapply(models,deviance.glmnet)
So the results:
lambda BIC MSE
1 0.003344587 447.2294 46.75395
2 0.028688622 408.8085 55.32970
3 0.056700061 370.3395 65.44696
4 0.078313596 362.2118 72.54230
5 0.090647978 359.2993 77.25786
6 0.062240390 359.1432 67.18382
7 0.077180937 361.6381 72.12735
8 0.044374976 382.2904 61.34681
9 0.088453015 358.1662 76.38739
10 0.069375991 357.8248 69.42866

Why calculating MSE in lasso regression gives different outputs?

I am trying to run different regression models on the Prostate cancer data from the lasso2 package. When I use Lasso, I saw two different methods to calculate the mean square error. But they do give me quite different results, so I would want to know if I'm doing anything wrong or if it just means that one method is better than the other ?
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min
# Getting predictions on the "test" data set and calculating the mean square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest)
# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
So these are the outputs I got for both MSE:
MSE (method 1): 0.4609978
MSE (method 2): 0.5654089
And they're quite different. Does anyone know why ?
Thanks a lot in advance for your help!
Samuel
As pointed out by #alistaire, in the first case you are using the test data to compute the MSE, in the second case the MSE from the cross-validation (training) folds are reported, so it's not an apples to apples comparison.
We can do something like the following to do apples to apples comparison (by keeping the fitted values on the training folds) and as we can see mse.1 and mse.2 are exactly equal if computed on the same training folds (although the value is little different from yours, with my desktop R version 3.1.2, x86_64-w64-mingw32, windows 10):
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)
# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain)
cat("MSE (method 1): ", mse.1, "\n")
MSE (method 1): 0.6044496
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
MSE (method 2): 0.6044496
mse.1 == mse.2
[1] TRUE

Choose model by BIC in a stepwise algorithm after choosing model from glmnet

I have data where number of observation n is smaller than number of variables p. The answer variable is binary. For example:
n <- 10
p <- 100
x <- matrix(rnorm(n*p), ncol = p)
y <- rbinom(n, size = 1, prob = 0.5)
I would like to fit logistic model for this data. So I used the code:
model <- glmnet(x, y, family = "binomial", intercept = FALSE)
The function returns 100 models for different $\lambda$ values (penalization parameter in LASSO regression). I would like to choose the biggest model which also has n - 1 parameters or less (so less than number of observations). Let's say the chosen model is for lambda_opt.
model_one <- glmnet(x, y, family = "binomial", intercept = FALSE, lambda = lambda_opt)
Now I would like to do the second step - use step function to my model to choose the submodel which will be the best in term of BIC - Bayesian Information Criterion. Unfortunately the step function doesn't work for objects of the glmnet class.
step(model_one, direction = "backward", k = log(n))
How can I perform such procedure? Is there any other function for this specific class (glmnet) to do what I want?
BIC is a fine way to select a penalty parameter from the sequence returned by glmnet, it's faster the cross validation and works quite well at least in the settings where I've tried it.
Compute the residuals sum of square for each value of the penalty parameter in the sequence (use predict(model,x) to get the fit)
model$df gives you the degrees of freedom.
Combine those to get a BIC and pick the value of lambda corresponding to the lowers BIC.

Resources