calculating the confidence intervals between two approaches - math

I'm trying to do something like what has mentioned in "VSUMM" article.
There are 5 approaches which i want to compare their results with mine. i do compare all of them in a table by calculating the CUS(a) i.e mean accuracy rate and CUS(e) i.e mean error rate for each of them.(it's a video summarization problem and each approaches has got their summeries for 50 videos. this summeries has been compared with five user summeries and then the CUS(a) and CUS(e) are calculated by these reults)
now i should calculate the confidence intervals(with the confidence of 98%) of each of those approaches with mine
i surf the web and find long articles about confidence intervals but i did not really understand any thing!?
what should i do?

finally find the answer:
For calculating the confidence intervals between your approach and other approaches you should have the results of your algorithm and those approaches for sure. then you can calculate mean of these data and their Standard deviation. then you could use the confidence intervals formula:
(amean-bmean) +/- z(1-a/2)*sqrt((stda^2/n1)+(stdb^2/n2)
which stda is Standard deviation of a
n1 the number of data in first group
amean=the mean value of the first group, i.e the mean of results of first approach

Related

99% confidence interval, proportion

Maybe a dumb question, but I just started, so thanks for any help.
I created 99% confidence interval for a proportion, but I'm not sure if it is correct, how can I make sure, (when we're calculating confidence interval for mean, we're using t-score, and we can test the results by using t.test function and degrees of freedom)
Is there any similar function to do the same thing for z, proportions? or I can do the same thing by using t.test?
There are a number of functions in R for computing confidence intervals for proportions. Personally, I like to use the binomCI function from the MKinfer package:
install.packages("MKinfer")
library(MKinfer)
x <- 50 # The number of "successes" in your data
n <- 100 # The number of observations in your data
binomCI(x, n, conf.level = 0.99, method = "wald")
Note however that the so-called Wald interval (which is presented in most introductory texts on statistics) that you probably computed usually is a poor choice. See this link for some other alternatives available through binomCI.

Calculate basic statistics in R

I am a noob when using R.
My experiment: I have 300 genotypes, each one planted in 6 different locations. For every genotype in every location I have a measure of the yield.
What I would like to do: I would like to calculate the mean, standard deviation and standard error for every genotype, first using the yield data of the 6 locations. Later, I want to calculate the same statistical parameters for only 5 locations and then 4 locations.
This is example of my desired output:
I have been searching for days, but I cannot find how to do it.
Let's say this is your data:
library(data.table)
dt= data.table(genotype=sample(1:10,size=20,replace=T),
location=sample(1:6,size=20,replace=T),
yield=round(runif(20,1000,1500)))
Then, first thing to do is to take the mean of yield, by genotype:
m1 = dt[,.(mean_6_locations=mean(yield)),by=genotype]
After that, assuming that you know which locations to exclude, here is the mean of 5 and 4 locations respectively:
m2 = dt[!location %in% c(10),.(mean_5_locations=mean(yield)),by=genotype]
m3 = dt[!location %in% c(5,10),.(mean_4_locations=mean(yield)),by=genotype]
Note that location 10 is excluded for mean of 5 locations, similarly locations 5 and 10 are excluded for mean of 4 locations.
Lastly, you need to merge everything into a single table:
m12 = merge(m1,m2)
m123 = merge(m12,m3)
print(m123)
This is an interesting thing, and I would do it with Monte Carlo "like" methods. Definitely I would encourage nonparametric methods because the dimensionality of the data doesn't support distributional assumptions.
Assume genotype doesn't matter, and aggregate over the six locations [or 5 or 4]... to make a distribution of means. The corresponding quintile of one specific genotypes mean corresponding number of locations mean tells you a lot more about the genotype than the mean itself. Also the standard error of means falls out of that distribution.
The standard deviation of this distribution similarly let's you know the standard deviation among means and allows for significance testing.
I know this answer is a little tangent, but building a distribution for six locations and taking the standard deviation of that doesn't tell you much.
Similarly,If you take the standard deviation of all rows and build a distribution of standard deviations, you can see how tightly a given genotypes standard deviation is relative to the population again just by using a quintile.
I assume the optimal genotype would be high quintile in the mean distribution and low quintile in the standard deviation distribution for a given location or among all locations. Depending of course on the specific question being addressed

Compare variances between two populations with different means

I would like to compare two populations which have different means. I want to find a way to compare their variances, to have an idea of which of the two populations have values that disperse further from the mean.
The issues is that I think I should need a variance standardized/normalized on the mean value of each distribution.
Suggestions?
The next step would be to get a function in R that it is able to do that.
You don't need to standardise/normalise because variance is calculated as distance from the mean so is already normalised around the sample mean.
To demonstrate this run the following code
x<-runif(10000,min=100,max=101)
y<-runif(10000,min=1,max=2)
mean(x)
mean(y)
var(x)
var(y)
You'll see while the mean is different the variance of the two samples is identical (allowing for some difference due to pseudo-random number generation and sample size)

Adding (Mean +/- 2 Sigma) lines in box plot

I have a data set on which I would like to make a boxplot in R. I was able to do it simply (here I used just some sample data) by the following commands in R console:
x <- rnorm(100);
y <- rnorm(100);
boxplot(x, y, col=rainbow(2))
Now what I really want is to show the (mean +/- (2 or 3)*sigma) marks for my data set on top of the above boxplot. I do know standard deviation and mean is not robust statistic, but for my data-set that information may be useful. I.e. the intended figure should look like:
Where the dotted lines (of different colors) represent corresponding plots' 95% or 99.7% confidence interval or (mean +/- (2 or 3)*sigma).
N.B. Unfortunately I can't add the figures here since I don't have enough reputation, will appreciate if someone can upload here from the above links.
Something that would be very simple and easy to do, would be to use the notch=TRUE argument in the boxplot() function (see ?boxplot). This isn't quite the same as a 95% confidence interval for the mean, but is instead analogous to a 95% CI for the median (which would address your concern about robustness). It is explained here:
The notches (if requested) extend to +/-1.58 IQR/sqrt(n). This seems
to be based on the same calculations as the formula with 1.57 in
Chambers et al. (1983, p. 62), given in McGill et al. (1978, p. 16).
They are based on asymptotic normality of the median and roughly equal
sample sizes for the two medians being compared, and are said to be
rather insensitive to the underlying distributions of the samples. The
idea appears to be to give roughly a 95% confidence interval for the
difference in two medians.

Likert Rank ordering optimization heuristic possible?

I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.

Resources