Pooled likelihood in R, how to code it? - r

I don't really know how to construct the following likelihood in R.
I have panel data for several individuals. Each individual is observed over time and I have several observations for each individual. I assume that the observations for each individual are independent and follow a normal distribution with an individual-specific mean and a variance common to all the individuals. In other words, I want to use all the data for estimating the variance, while using individual specific data to estimate the individual means.
Formally, let $i=(1,2,...,N)$ be the $i-th$ individual and $j=(1,2,...,n_i)$ the $j-th$ observation of individual $i$.
The likelihood function for individual $i$ is
$$
L_i(\mu_i,\sigma)=\prod_{j=1}^{n_i}\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(x_{ij}-\mu_i)^2}{2\sigma^2})
$$
L_i(mu_i, sigma) = PRODUCT_j 1/(sigma*sqrt 2pi) * exp( -(x_ij-mu_i)^2/2sigma^2 )
and the full likelihood is then
$$
L(\mu_1,...\mu_N,\sigma)=\prod_{i=1}^N \prod_{j=1}^{n_i}\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{(x_{ij}-\mu_i)^2}{2\sigma^2})
$$
L_i(mu_i,...,mu_N, sigma) = PRODUCT_i L_i(mu_i, sigma)
What's a smart way to code the log of the full likelihood in R in order to pass it to some optimizer?
Example data:
set.seed(2)
A <- rep(1,15)
B <- rep(2,11)
ID <- c(A,B)
dA <- rnorm(15,2,3)
dB <- rnorm(11,1,3)
X <- c(dA,dB)
DATA <- data.frame(ID,X)
DATA

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

How to extract coefficients outputs from a linear regression with loop

I would like to know how I can loop a regression n times, and in each time with a different set of variables, extract a data.frame where each column is a regression and each row represent a variable.
In my case I have a data.frame of:
dt_deals <- data.frame(Premium=c(1,3,4,5),Liquidity=c(0.2,0.3,1.5,0.8),Leverage=c(1,3,0.5,0.7))
But I have another explanatory dummy variable called hubris, that is the product of a binomial distribution, with 0.25 of mean. Like that:
n <- 10
hubris_dataset <- data.frame(replicate(n, rbinom(4,1,0.25))
In this sense, what I need is to make n simulation of hubris, so I can, make n regression each one with a different set of random binomial distribution and the output of each distribution needs to be put in a data.frame
So far I could reach this:
# define n as the number of simulations i want
n=10
# define beta as a data.frame to put every coefficient from the lm regression
beta=NULL
for(i in 1:n) {
dt_deals2 <- dt_deals
beta[[i]] <- coef(lm(dt_deals$Premium ~ dt_deals$Liquidity + dt_deals$Leverage + hubris_dataset[,i], data=dt_deals2))
beta <- cbind(reg$coefficients)
}
But this way it only generate the first set of coefficient, and doesn't make another ten columns for the data.frame.
#jogo give an idea to change the for-loop method and use sapply, and change the object beta to list(). This was the result:
beta <- sapply(1:n, function(i) coef(lm(Premium ~ Liquidity +Leverage+ hubris_dataset[,i], data=dt_deals2)))
And it worked

MLE regression that accounts for two constraints

So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions

Compare model fits of two GAM

I have a matrix Expr with rows representing variables and columns samples.
I have a categorical vector called groups (containing either "A","B", or "C")
I want to test which of variables 'Expr' can be explained by the fact that the sample belong to a group.
My strategy would be modelling the problem with a generalized additive model (with a negative binomial distribution).
And then I want use a likelihood ratio test in a variable wise way to get a p value for each variable.
I do:
require(VGAM)
m <- vgam(Expr ~ group, family=negbinomial)
m_alternative <- vgam(Expr ~ 1, family=negbinomial)
and then:
lr <- lrtest(m, m_alternative)
The last step is wrong because it is testing the overall likelihood ratio of the two model not the variable wise.
Instead of a single p value I would like to get a vector of the p-values for every variable.
How should I do it?
(I am very new to R, so forgive me my stupidity)
It sounds like you want to use Expr as your predictors It think you may have your formula backwards. The response should be on the left, so I guess that's groups in your case.
If Expr is a data.frame, you can do regression on all variables with
m <- vgam(group ~ ., Expr, family=negbinomial)
If class(Expr)=="matrix", then
m <- vgam(group ~ Expr, family=negbinomial)
probably should work, but you may just get slightly odd looking coefficient labels.

extract residuals from aov()

I've run an anova using the following code:
aov2 <- aov(amt.eaten ~ salt + Error(bird / salt),data)
If I use view(aov2) I can see the residuals within the structure of aov2, but I would like to extract them in a way that doesn't involve cutting and pasting. Can someone help me out with the syntax?
Various versions of residuals(aov2) I have been using only produce NULL
I just learn that you can use proj:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[['Within']][,'Residuals']
The reason that you cannot extract residuals from this model is that you have specified a random effect due to the bird salt ratio (???). Here, each unique combination of bird and salt are treated like a random cluster having a unique intercept value but common additive effect associated with a unit difference in salt and the amount eaten.
I can't conceive of why we would want to specify this value as a random effect in this model. But in order to sensibly analyze residuals, you may want to calculate fitted differences in each stratum according to the fitted model and optimal intercept. I think this is tedious work and not very informative, however.

Resources