MLE regression that accounts for two constraints - r

So I am wanting to create a logistic regression that simultaneously satisfies two constraints.
The link here, outlines how to use the Excel solver to maximize the value of Log-Likelihood value of a logistic regression, but I am wanting to implement a similar function in R
What I am trying to create in the end is an injury risk function. These take an S-shape function.
As we see, the risk curves are calculated from the following equation
Lets take some dummy data to begin with
set.seed(112233)
A <- rbinom(153, 1, 0.6)
B <- rnorm(153, mean = 50, sd = 5)
C <- rnorm(153, mean = 100, sd = 15)
df1 <- data.frame(A,B,C)
Lets assume A indicates if a bone was broken, B is the bone density and C is the force applied.
So we can form a logistic regression model that uses B and C are capable of explaining the outcome variable A. A simple example of the regression may be:
Or
glm(A ~ B + C, data=df1, family=binomial())
Now we want to make the first assumption that at zero force, we should have zero risk. This is further explained as A1. on pg.124 here
Here we set our A1=0.05 and solve the equation
A1 = 1 - (1-P(0))^n
where P(0) is the probability of injury when the injury related parameter is zero and n is the sample size.
We have our sample size and can solve for P(0). We get 3.4E-4. Such that:
The second assumption is that we should maximize the log-likelihood function of the regression
We want to maximize the following equation
Where pi is estimated from the above equation and yi is the observed value for non-break for each interval
My what i understand, I have to use one of the two functions in R to define a function for max'ing LL. I can use mle from base R or the mle2 from bbmle package.
I guess I need to write a function along these lines
log.likelihood.sum <- function(sequence, p) {
log.likelihood <- sum(log(p)*(sequence==1)) + sum(log(1-p)*(sequence==0))
}
But I am not sure where I should account for the first assumption. Ie. am I best to build it into the above code, and if so, how? Or will it be more effiecient to write a secondary finction to combine the two assumptions. Any advice would be great, as I have very limited experience in writing and understanding functions

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Defining exponential distribution in R to estimate probabilities

I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.

How to perform a Multivariate Polynomial Regression when output has stochastic behavior?

I have a experiment being simulated. This experiment has 3 parameters a,b,c (variables?) but the result, r, cannot be "predicted" as it has a stochastic component. In order to minimize the stochastic component I've run this experiment several times(n). So in resume I have n 4-tuples a,b,c,r where a,b,c are the same but r varies. And each batch of experiments is run with different values for a, b, c (k batches) making the complete data-set having k times n sets of 4-tuples.
I would like to find out the best polynomial fit for this data and how to compare them like:
fit1: with
fit2: with
fit3: some 3rd degree polynomial function and corresponding error
fit4: another 3rd degree (simpler) polynomial function and corresponding error
and so on...
This could be done with R or Matlab®. I've searched and found many examples but none handled same input values with different outputs.
I considered doing the multivariate polynomial regression n times adding some small delta to each parameter but I'd rather take a cleaner sollution before that.
Any help would be appreciated.
Thanks in advance,
Jacques
Polynomial regression should be able to handle stochastic simulations just fine. Just simulate r, n times, and perform a multivariate polynomial regression across all points you've simulated (I recommend polyfitn()).
You'll have multiple r values for the same [a,b,c] but a well-fit curve should be able to estimate the true distribution.
In polyfitn it will look something like this
n = 1000;
a = rand(500,1);
b = rand(500,1);
c = rand(500,1);
for n = 1:1000
for i = 1:length(a)
r(n,i) = foo(a,b,c);
end
end
my_functions = {'a^2 b^2 c^2 a b c',...};
for fun_id = 1:length(my_functions)
p{f_id} = polyfitn(repmat([a,b,c],[n,1]),r(:),myfunctions{fun_id})
end
It's not hard to iteratively/recursively generate a set of polynomial equations from a basis function; but for three variables there might not be a need to. Unless you have a specific reason for fitting higher order polynomials (planetary physics, particle physics, etc. physics), you shouldn't have too many functions to fit. It is generally not good practice to use higher-order polynomials to explain data unless you have a specific reason for doing so (risk of overfitting, sparse data inter-variable noise, more accurate non-linear methods).

Multilevel logistic regression guessing parameter

I am working in R in package lme4 and in MPlus and have a following situation:
I want to predict variable B (which is dichotomous) from variable A (continous) controlling for random effects on the level of a) Subjects; b) Tasks.
A -> B (1)
The problem is that when I use model to predict the values of B from A, values below probability of 0.5 get predicted, and in my case that doesn´t make sense, because, if you guess at random, the probability of correct answer on B would be 0.5.
I want to know how I can constrain the model (1) in R or in MPlus so that it doesn´t predict values lower than 0.5 in variable B.
Thank you!
I found a solution to the question thanks to Mr Kenneth Knoblauch. Basically, you need the psyphy package to use mafc.logit function.
For example, the code then looks like this:
mod <- glm(B ~ A, data = df, family = binomial(mafc.logit(.m = 2)))
It then involves the guessing parameter for (.m = 2) - two-choice tasks.
Cheers!

Weighted censored regression in R?

I am very new to R (mostly program in SQL) but was faced with a problem that SQL couldn't help me with. I'll try to simplify the problem below.
Assume I have a set of data with 100 rows where each row has a different weight associated with it. Out of those 100 rows of data, 5 have an X value that is top-coded at 1000. Also assume that X can be represented by the linear equation X ~ Y + Z + U + 0 (want a positive value so I don't want a Y-intercept).
Now, without taking the weights of each row of data into consideration, the formula I used in R was:
fit = censReg(X ~ Y + Z + U + 0, left = -Inf, right = 1000, data = dataset)
If I computed summary(fit) I would get 0 left-censored values, 95 uncensored values, and 5 right censored values which is exactly what I want, minus the fact that the weights haven't been sufficiently added into the mix. I checked the reference manual on the censReg function and it doesn't seem like it accepts a weight argument.
Is there something I'm missing about the censReg function or is there another function that would be of better use to me? My end goal is to estimate X in the cases where it is censored (i.e. the 5 cases where it is 1000).
You should use Tobit regression for this situation, it is designed specifically to linearly model latent variables such as the one you describe.
The regression accounts for your weights and the censored observations, which can be seen in the derivation of the log-likelihood function for the Type I Tobit (upper and lower bounded).
Tobit regression can be found in the VGAM package using the vglm function with a tobit control parameter. An excellent example can be found here:
http://www.ats.ucla.edu/stat/r/dae/tobit.htm

Resources