Success/failure error estimation in R - r

I have success/failure data (trees that survived/died over a certain period) and would like to estimate an error from a binomial distribution to be associated to each of my observations (7 sites). So far I have been using glm to do so:
s <- c(1,20,0,40,2,1,0) # success
f <- c(2,0,20,4,50,0,1) # failure
#for each observation I would calculate this error:
error <- vector ()
z_scores <- vector ()
p_value <- vector ()
for (i in 1:7) {
models <- glm (cbind (s[i], f[i]) ~ 1, family = 'binomial')
error [i] <- summary (models)$coefficients[2]
z_scores [i] <- summary (models)$coefficients[3]
p_value [i] <- summary (models)$coefficients[4]
}
Would this be the best approach?
How is the probability of the binomial distribution estimated here?
Note that regardless the number of success and failure my error is extremely high when either s or f are =0

Here is some code to recompute most of your results (except the extremes ones caused by zero) without using of glm, and I explain the meaning behind them.
s <- c(1, 20, 0, 40, 2, 1, 0) # success
f <- c(2, 0, 20, 4, 50, 0, 1) # failure
#for each observation I would calculate this error:
error <- vector()
z_scores <- vector()
p_value <- vector()
for (i in 1:7) {
models <- glm(cbind(s[i], f[i]) ~ 1, family = 'binomial')
error[i] <- summary(models)$coefficients[2]
z_scores[i] <- summary(models)$coefficients[3]
p_value[i] <- summary(models)$coefficients[4]
}
logit <- function(x){
log(x / (1 - x))
}
dlogit <- function(x){
1 / x / (1 - x)
}
p_hat <- s / (s + f)
## sqrt(p_hat * (1 - p_hat) / (s + f))
## is the standard error of p_hat
## error1 is the standard error of logit(p_hat)
error1 <- dlogit(p_hat) * sqrt(p_hat * (1 - p_hat) / (s + f))
## divide the estimation by the standard error, you get z-score
z_scores1 <- logit(p_hat) / error1
p_value1 <- 2 * pnorm(-abs(z_scores1))
The first thing you need to know is the rationale behind the standard error, z-score, p-value and etc. In statistics, we first have some model (in this case, Binomial model: s|(s+f) ~ Binomial(s + f, p)) and we want to use it to fit the data we have and
1) get estimations (p in this case)
2) since data is generated randomly, we want to know how good is our estimate, here comes standard error, z-scores and p-value to "measure the randomness in the estimation", and here is some important "trick": since we don't know the true mechanism that generates the data, we can only approximately calculate the randomness in our estimation by making assumptions
a) our model is (or similar to) the true mechanism of data generation and
b) the real parameter is similar to our estimation (this often requires large sample size, in this case, sample size is just s + f, so s + f must be large enough to make the inference (standard error, z-score and p-value) validated). And we can see that in case i = 1, 6 and 7, the sample size is really small, which makes the corresponding standard errors, z-scores and p-values incredible.
And then I can talk about the technical details behind my calculations and what do they mean. In glm, besides a Binomial(n, p) model, you also assume a model for p like this:
logit(p) ~ N(mu, sigma^2)
And the logit function is just like that in my code.
In this simple case, the estimation for Binomial probability p is just p_hat <- s / (s + f) (whether use glm or not), and from the variance formula for Binomial variable, we can get the variance for the estimated probability p is p * (1 - p) / n, here if we think p_hat <- s / (s + f) is similar to the real p by the assumption b, and use it to replace p, we can get standard error for the estimated p. Following CLT and Delta method, when the sample size is large enough, we can treat s / (s + f) or logit(s / (s + f)) as following a normal distribution, for example, s / (s + f) is approximately N(p, s * f / (s + f) ^ 3) and logit(s / (s + f))is approximately N(logit(p), dlogit(s / (s + f)) ^ 2 * s * f / (s + f) ^ 3).
Simply speaking, the standard error, z-scores and p-values that glm calculates are just the standard error, z-scores and p-values for logit(s / (s + f)). These are valid results for the null hypothesis: logit(p) = 0, in other words, p = 0.5. So the z-scores and p-values obtained from glm is to test whether or not s and f happens with equal probability when the sample size s + f is large.
And then I will talk about the extreme values caused by 0. When s or f equals 0, the estimated probability of f or s happens will be 1, if this is true, the data generation mechanism is actually non-random!! At the beginning I have said that we use our estimations to approximately calculate the randomness in our estimations, and in the case that s or f equals 0, if we use our estimations as the ground truth, we should believe our estimations in 100% percent, which is kind of ridiculous. And in such cases, a lot of methods like glm will not be valid. Generally speaking, if the sample size s + f is big enough, we believe that the probability of s or f happens is really small if s = 0 or f = 0, but if the sample size is really small like in case 6 or 7, we actually cannot reach any conclusion.
In sum, if the binomial model is true, from the glm result, my code and my analysis as provided above, we can say that in case i = 2, 3, 4, 5, the probability of s and f is different from each other significantly.

Related

Singular gradient in nls model - how can I make it work?

I have data, on which I want to fit a non-linear regression model. The model is a physical model to compute the Chlorid defusion coefficient. In my case the model looks like
Cx = Ci + (Cs - Ci) * erfc(x / (sqrt(4 * D * t))
with Ci = 0.020664, t = 28/365, x and Cx being in the data and Cs and D are the coefficients to be computed. Erfc is the complementary error function.
I have data in form of
data = data.frame(x=c(2.13, 4.38, 6.38, 8.13, 10.38, 13.88, 17.38),
Cx=c(0.085, 0.017, 0.011, 0.010, 0.009, 0.010, 0.009))
So what I coded in R was
erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1 # error function
erfc <- function(x) 1 - erf(x) # complementary error function
m1 <- nls(formula = Cx ~ 0.020664 + (Cs - 0.020664) *
erfc(x / (sqrt(4 * D * (28/365)))),
data = data,
start = list(Cs = 0.5, D = 50))
Which gives me the error message "singular gradient". Since the data is already given and I can't really change the model either, has somebody an idea how to solve this?
(I saw that often times it is recommended to use a different library than nls when this error occurs, but these (i.e nlsr) couldn't derive the erfc function.)
Since erfc > 0 if Cs - 0.020664 is positive then the second term is positive so the entire RHS will be above 0.020664 whereas all the
points except the first are below it. Also if Cs - 0.020664 is negative
then all points will be below 0.020664 in which case it will be far from the first point.
Thus it is not a matter of finding a different algorithm -- your
model simply does not fit the data.
Also as a general comment simply looking for a different algorithm when nls fails is often a poor strategy since the situation we have here is often the case. A better strategy is to understand the model better and improve it.
If we modify the model slightly as a linear combination of 1 and erfc then we can get a good fit. The linear combination coefficients are .lin.A and .lin.B and do not need starting values when using the plinear algorithm. The algorithm converges in only 3 iterations which together with the plot below shows it fits the data well. Note that this revised model still has parameter D so if that is the main parameter of interest you can still use this model.
In the plot below the circles are the data points and the line is the fitted values at them.
fm <- nls(formula = Cx ~ cbind(A = 1, B = erfc(x / sqrt(4 * D * 28/365))),
data = data, start = list(D = 25), algorithm = "plinear")
fm
## Nonlinear regression model
## model: Cx ~ cbind(A = 1, B = erfc(x/sqrt(4 * D * 28/365)))
## data: data
## D .lin.A .lin.B
## 25.932426 0.009704 0.263631
## residual sum-of-squares: 2.041e-06
##
## Number of iterations to convergence: 3
## Achieved convergence tolerance: 6.247e-06
plot(Cx ~ x, data)
lines(fitted(fm) ~ x, data)

R: Least squares estimation in linear equation systems

I'm going to estimate some parameters of linear equation systems with repeated measures. My equations will look like this:
Variant1:
Variant2:
At least 10 values (repeated measures; technical replicates) are known for every and . I want to estimate the values for and resp. .
Additionally I'd like to know the standard error of these estimates, if possible.
In R, my data set would look like this (in reality I have :
i <- rep(1:3, each = 30)
j <- rep(rep(1:3, each = 10), 3)
K.i <- rep(c(6, 5, 10), each = 30) + rnorm(90)
K.ij <- K.i + rnorm(90)
# X_i, X_ij and x_ij should be 0 (since I assumed K_j being K_ij + normal noise
data <- cbind(i, j, K.i, K.ij)
How can I estimate the expected parameter values (minimizing the sums of squares) and the standard errors of these estimates in R?
Thanks a lot in advance for your help!

R: Confidence intervals on non-linear fit with a non-analytic model

I need to fit x-y data with a model, which is non-analytic. I have a function f(x) that calculates the model for each x numerically, but there is no analytical equation. For the fit, I use optim in R. I minimise RMS between the model and the data. It works well and returns reasonable parameters.
I would like to find confidence intervals (or at least standard errors) on the best-fitting parameters. I found on internet that this can be done from the Hessian matrix, but only if maximising log-likelihood function. I don't know how to do this, all I have is x, y and f(x) from which I find RMS. Alas, I have no good way of estimating errors on y.
How can I find confidence intervals on my fit parameters?
Edit: perhaps an example in R might help explaining what I'm asking for. This example uses a simple analytic function to fit the data, in my real case the function is non-analytic, so I cannot use, e.g., nls.
set.seed(666)
# generate data
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# use optim to fit the model to the data
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x, y)
# best-fitting parameters
best_a <- res$par[1]
best_b <- res$par[2]
The best fitting parameters are a = 0.50 and b = 0.20. I need to find 95% confidence intervals on these.
This is a job for the bootstrap:
(1) create a large number of synthetic datasets x*. These are created by sampling from x with replacement the same number of data as were in x. For example, if your data is (1,2,3,4,5,6) an x* might be (5,2,4,4,2,3) (note that values might appear multiple times, or not at all because we are sampling with replacement)
(2) For each x*, calculate f(x*). If there are other parameters which don't depend on the data, don't change them. (so f(x,a,b,c) becomes f(x*,a,b,c) as long as a,b,c don't depend on x. Call these quantities f*.
(3) You can estimate anything you want from these f*. If you want the standard deviation of f(x), take the standard deviation of f*. If you want the 95% confidence interval, take the range from the 2.5 to the 97.5 percentiles of f*. More formally, if you want to estimate g(f(x)) you estimate it as g(f(x*)).
I should say this is a very practically-oriented explanation of the bootstrap. I have glossed over many theoretical details, but the bootstrap is near-universally applicable (basically as long as the thing you are trying to estimate actually exists, you are usually okay).
To apply this to the example you have given in your code:
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# this is the part where we bootstrap
# use optim to fit the model to the data
best_a <- best_b <- numeric(10000)
for(i in 1:10000){
j <- sample(100,replace=TRUE)
x.boot <- x[j]; y.boot <- y[j]
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x.boot, y.boot)
# best-fitting parameters
best_a[i] <- res$par[1]
best_b[i] <- res$par[2]
}
# now, we look at the *vector* best_a
# for example, if you want the standard deviation of a,
sd(best_a)
# or a 95% confidence interval for b,
quantile(best_b,c(0.025,0.975))

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

random effects variance of intercept being zero

I am running a power analysis using a normal LMM in R. I have seven input parameters, two of which I do not need to test for (no. of years and no. of sites). The other 5 parameters are the intercept, slope and the random effects standard deviation of the residual, intercept and slope.
Given that my response data (year is the sole explanatory variable in the model) is bound between (-1, +1), the intercept also falls in this range. However, what I am finding is that if I run, say, 1000 simulations with a given intercept and slope (which I am treating as constant over 10 years), then if the random effects intercept SD falls below a certain value, there are many simulations where the random effects intercept SD is zero. If I inflate the intercept SD then this seems to simulate correctly (please see below where I use residual Sd=0.25, intercept SD = 0.10 and slope SD = 0.05; if I increase intercept SD to 0.2, this is correctly simulated; or if I drop the residual SD to say 0.05, the variance parameters are correctly simulated).
Is this problem due to my coercion of the range to be (-1, +1)?
I include the code for my function and the processing of the simulations below, if this would help:
Function: generating the data:
normaldata <- function (J, K, beta0, beta1, sigma_resid,
sigma_beta0, sigma_beta1){
year <- rep(rep(0:J),K) # 0:J replicated K times
site <- rep (1:K, each=(J+1)) # 1:K sites, repeated J years
mu.beta0_true <- beta0
mu.beta1_true <- beta1
# random effects variance parameters:
sigma_resid_true <- sigma_resid
sigma_beta0_true <- sigma_beta0
sigma_beta1_true <- sigma_beta1
# site-level parameters:
beta0_true <<- rnorm(K, mu.beta0_true, sigma_beta0_true)
beta1_true <<- rnorm(K, mu.beta1_true, sigma_beta1_true)
# data
y <<- rnorm(n = (J+1)*K, mean = beta0_true[site] + beta1_true[site]*(year),
sd = sigma_resid_true)
# NOT SURE WHETHER TO IMPOSE THE LIMITS HERE OR LATER IN CODE:
y[y < -1] <- -1 # Absolute minimum
y[y > 1] <- 1 # Absolute maximum
return(data.frame(y, year, site))
}
Processing the simulated code:
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
n.sims = 1000
sigma.resid <- rep(0, n.sims)
sigma.intercept <- rep(0, n.sims)
sigma.slope <- rep(0,n.sims)
intercept <- rep(0,n.sims)
slope <- rep(0,n.sims)
signif <- rep(0,n.sims)
for (s in 1:n.sims){
y.data <- normaldata(10,200, 0.30, ((0-0.30)/10), 0.25, 0.1, 0.05)
lme.power <- lmer(y ~ year + (1+year | site), data=y.data)
summary(lme.power)
theta.hat <- fixef(lme.power)[["year"]]
theta.se <- se.fixef(lme.power)[["year"]]
signif[s] <- ((theta.hat + 1.96*theta.se) < 0) |
((theta.hat - 1.96*theta.se) > 0) # returns TRUE or FALSE
signif[s]
betas <- fixef(lme.power)
intercept[s] <- betas[1]
slope[s] <- betas[2]
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
sigma.resid[s] <- vc1[4,5]
sigma.intercept[s] <- vc2[1]
sigma.slope[s] <- vc2[2]
cat(paste(s, " ")); flush.console()
}
power <- mean (signif) # proportion of TRUE
power
summary(sigma.resid)
summary(sigma.intercept)
summary(sigma.slope)
summary(intercept)
summary(slope)
Thank you in advance for any help you can offer.
This is really more of a statistical than a computational question, but the short answer is: you haven't made any mistakes, this is exactly as expected. This example on rpubs runs some simulations of a Normally distributed response (i.e. it corresponds exactly to the model assumed by LMM software, so the constraint you're worried about isn't an issue).
The lefthand histogram below is from simulations with 25 samples in 5 groups, equal variance (of 1) within and between groups; the righthand histogram is from simulations with 15 samples in 3 groups.
The sampling distribution of variances for null cases (i.e., no real between-group variation) is known to have a point mass or "spike" at zero; it's not surprising (although as far as I know not theoretically worked out) that the sampling distribution of the variances should also have a point mass at zero when the between-sample is non-zero but small and/or when the sample is small and/or noisy.
http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-variance has more on this topic.

Resources