Binomial confidence intervals of means with R - r

I have got 4 different data.frames that have observations that follow a binomial distribution and I need to calculate, for each one, the confidence intervals related to the means of a second column (Flow).
The number of successes are reported in the column Success and the total number of trials = 85.
How can I calculate confidence intervals?
How can I do it with R?
Here an example of my data.frames:
df1 <- read.table(text = 'Flow Success
725.661 4
25.54 4
318.481 4
230.556 4
2.823 3
12.6 3
9.891 3
11.553 1', header = TRUE)
> mean(df1$Flow)
[1] 167.1381
df2 <- read.table(text = 'Flow Success
725.661 3
25.54 3
318.481 3
230.556 2
2.823 2
12.6 1', header = TRUE)
> mean(df2$Flow)
[1] 219.2768
df3 <- read.table(text = 'Flow Success
725.661 2
25.54 2
318.481 1', header = TRUE)
> mean(df3$Flow)
[1] 356.5607
df4 <- read.table(text = 'Flow Success
725.661 2
25.54 2', header = TRUE)
> mean(df4$Flow)
[1] 375.6005
I need to calculate the confidence intervals of the above means.
I can give you more info about the data if needed.
Thanks for anyone who will help me.

The package binom provides methods for calculating binomial confidence intervals. One can choose to use all available methods, or specify a single method.
x gives the number of successes, and n the number of Bernouli trials.
library(binom)
binom.confint(x = 5, n = 10)
method x n mean lower upper
1 agresti-coull 5 10 0.5 0.2365931 0.7634069
2 asymptotic 5 10 0.5 0.1901025 0.8098975
3 bayes 5 10 0.5 0.2235287 0.7764713
4 cloglog 5 10 0.5 0.1836056 0.7531741
5 exact 5 10 0.5 0.1870860 0.8129140
6 logit 5 10 0.5 0.2245073 0.7754927
7 probit 5 10 0.5 0.2186390 0.7813610
8 profile 5 10 0.5 0.2176597 0.7823403
9 lrt 5 10 0.5 0.2176212 0.7823788
10 prop.test 5 10 0.5 0.2365931 0.7634069
11 wilson 5 10 0.5 0.2365931 0.7634069
binom.confint(x = 5, n = 10, method = "exact")
method x n mean lower upper
1 exact 5 10 0.5 0.187086 0.812914

Related

Trying to add breakpoint lines from strucchange to a plot by "lines" command [duplicate]

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

Friedman's test manual calculation

I have obtained a negative value from the Friedman's test. The data is:
Full MIC ReliefF LCorrel InfoGain
equinox 69.939 80.178 78.794 75.205 62.268
lucene 78.175 84.103 79.017 82.044 75.564
mylyn 75.531 78.006 77.161 47.711 81.575
pde 70.282 82.686 81.884 75.07 79.476
jdt 71.675 93.202 95.387 85.878 82.818
Ranking is below
Full MIC ReliefF LCorrel InfoGain
equinox 2 5 4 3 1
lucene 2 5 3 4 1
mylyn 2 4 3 1 5
pde 1 5 4 2 3
jdt 1 4 5 3 2
Sum 8 23 19 13 12
The Friedman's F Calculation formula:
F = (5/[5*5*(5+1)] * [8*8 + 23*23 + 19*19 + 13*13 + 12*12] - [5*5*(5+1)]
The value I get is -107.7666667.
How do I interpret that? The examples I have seen all have positive result.
I know about the R code but want the manual calculation.
This is how I generated the results and it worked
pacc_part
f1 <- friedman.test(pacc_part)
print (f1)
# Post-hoc tests are conducted only if omnimus Kruskal-Wallis test p-value
is 0.05 or less.
if ( f1$p.value < 0.05 )
{
n1 <- posthoc.friedman.nemenyi.test(pacc_part)
}
n1;
# alternate representation of post-hoc test results
summary(n1);

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

strucchange not reporting breakdates

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

Resources