I have two sets of statistics generated from processing. The data from the processing can be a large amount of results so I would rather not have to store all of the data to recalculate the additional data later on.
Say I have two sets of statistics that describe two different sessions of runs over a process.
Each set contains
Statistics : { mean, median, standard deviation, runs on process}
How would I merge the two's median, and standard deviation to get a combined summary of the two describing sets of statistics.
Remember, I can't preserve both sets of data that the statistics are describing.
Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:
new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n
edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.
You can get the mean and standard deviation, but not the median.
new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n
new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2
where n(0) is the number of runs in the first data set, n(1) is the number of runs in the second, and so on, mean is the mean, and var is the variance (which is just standard deviation squared). n**2 means "n squared".
Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,
Var(X) = E(X^2) - E(X)^2
The var(n)+mean(n)**2 terms above give us the E(X^2) portion which we can then combine with other data sets, and then get the desired result.
In terms of medians:
If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.
If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.
Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.
Related
I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)
I would like to compare two populations which have different means. I want to find a way to compare their variances, to have an idea of which of the two populations have values that disperse further from the mean.
The issues is that I think I should need a variance standardized/normalized on the mean value of each distribution.
Suggestions?
The next step would be to get a function in R that it is able to do that.
You don't need to standardise/normalise because variance is calculated as distance from the mean so is already normalised around the sample mean.
To demonstrate this run the following code
x<-runif(10000,min=100,max=101)
y<-runif(10000,min=1,max=2)
mean(x)
mean(y)
var(x)
var(y)
You'll see while the mean is different the variance of the two samples is identical (allowing for some difference due to pseudo-random number generation and sample size)
I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...
The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.
You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!
Here is what I want to do:
I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame.
I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination).
I want to do this let us say 100k times and store each result in the form
ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean
so that I get a 9*100k df.
I tried some things already but R is very bad with loops and I know vector oriented
solutions are better because of R design.
Here is what I did and I know it is horrible
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
e=NULL
for (x in 1:100000)
{
s=sample(1:100,4)#pick 4 variables randomly
a=sample(seq(0,1,0.01),1)
b=sample(seq(0,1-a,0.01),1)
c=sample(seq(0,(1-a-b),0.01),1)
d=1-a-b-c
e=c(a,b,c,d)#4 random weights
average=mean(timeseries.df[,s]%*%t(e))
e=rbind(e,s,average)#in the end i get the 9*100k df
}
The procedure runs way to slow.
EDIT:
Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R.
Then the problem becomes a little bit complex if i want to calculate the standard deviation.
i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance
t(sampleweights)%*%sample_cov.mat%*%sampleweights
to get in the end the ts.weighted_standard_dev matrix
Last question what is the best way to proceed if i want to bootstrap the original df
x times and then apply the same computations to test the robustness of my datas
thanks
Ok, let me try to solve your problem. As a foreword: I can think of no application where it is sensible to do what you are doing. However, that is for you to judge (non the less I would be interested in the application...)
First, note that the mean of the weighted sums equals the weighted sum of the means, as:
Let's generate some sample data:
timeseries.df <- data.frame(matrix(runif(1000, 1, 10), ncol=40))
n <- 4 # number of items in the convex combination
replications <- 100 # number of replications
Thus, we may first compute the mean of all columns and do all further computations using this mean:
ts.means <- apply(timeseries.df, 2, mean)
Let's create some samples:
samples <- replicate(replications, sample(1:length(ts.means), n))
and the corresponding weights for those samples:
weights <- matrix(runif(replications*n), nrow=n)
# Now norm the weights so that each column sums up to 1:
weights <- weights / matrix(apply(weights, 2, sum), nrow=n, ncol=replications, byrow=T)
That part was a little bit tricky. Run the single functions on each own with a small number of replications to figure out what they are doing. Note that I took a different approach for generating the weights: First get uniformly distributed data and then norm them by their sum. The result should be identical to your approach, but with arbitrary resolution and much better performance.
Again a little bit trick: Get the means for each time series and multiply them with the weights just computed:
ts.weightedmeans <- matrix(ts.means[samples], nrow=n) * weights
# and sum them up:
weights.sum <- apply(ts.weightedmeans, 2, sum)
Now, we are basically done - all information are available and ready to use. The rest is just a matter of correctly formatting the data.frame.
result <- data.frame(t(matrix(names(ts.means)[samples], nrow=n)), t(weights), weights.sum)
# For perfectness, use better names:
colnames(result) <- c(paste("Sample", 1:n, sep=''), paste("Weight", 1:n, sep=''), "WeightedMean")
I would assume this approach to be rather fast - on my system the code took 1.25 seconds with the amount of repetitions you stated.
Final word: You were in luck that I was looking for something that kept me thinking for a while. Your question was not asked in a way to encourage users to think about your problem and give good answers. The next time you have a problem, I would suggest you to read www.whathaveyoutried.com before and try to break down the problem as far as you are able to. The more concrete your problem, the faster and of higher quality your answers will be.
Edit
You mentioned correctly that the weights generated above are not uniformly distributed over the whole range of values. (I still have to object that even (0.9, 0.05, 0.025, 0.025) is possible, but it is very unlikely).
Now we are playing in a different league, though. I am pretty sure that the approach you took is not uniformly distributed as well - the probability of the last value being 0.9 is far less than the probability of the first one being that large. Honestly I do not have a good idea ready for you concerning the generation of uniformly distributed random numbers on the unit sphere according to the L_1 distance. (Actually, it is not really a unit sphere, but both problems should be identical).
Thus, I have to give up on this.
I would suggest you to raise a new question at stats.stackexchange.com concerning the generation of those random vectors. It probably is fairly simple using the correct technique. However, I doubt that this question with that heading and a fairly long answer will attract a potential responder... (If you ask the question over there, I would appreciate a link, as I would like to know the solution ;)
Concerning the variance: I do not fully understand which standard deviation you want to compute. If you just want to compute the standard deviation of each time series, why do you not use the built-in function sd? In the computation above you could just replace mean by it.
Bootstrapping: That is a whole new question. Separate different topics by starting new questions.
Consider a vector V riddled with noisy elements. What would be the fastest (or any) way to find a reasonable maximum element?
For e.g.,
V = [1 2 3 4 100 1000]
rmax = 4;
I was thinking of sorting the elements and finding the second differential {i.e. diff(diff(unique(V)))}.
EDIT: Sorry about the delay.
I can't post any representative data since it contains 6.15e5 elements. But here's a plot of the sorted elements.
By just looking at the plot, a piecewise linear function may work.
Anyway, regarding my previous conjecture about using differentials, here's a plot of diff(sort(V));
I hope it's clearer now.
EDIT: Just to be clear, the desired "maximum" value would be the value right before the step in the plot of the sorted elements.
NEW ANSWER:
Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:
diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);
Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:
binEdges = min(V):1e7:max(V); % Create vector of bin edges
n = histc(V,binEdges); % Bin amplitude data
binThreshold = 100; % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first'); % Find first bin whose count is low
signalThreshold = binEdges(index);
OLD ANSWER (for posterity):
Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:
nDevs = 2; % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V); % Index of "reasonable" values
maxValue = max(V(index)); % Maximum of "reasonable" values
I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).
You need to describe formally the expected information in the vector, and the type of noise.
You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.
To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution
First, pick your favorite method for identifying outliers...
If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.
Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?
In that case, you could remove noise, and then find the max
max( v( find(v<=10 & v>=-10) ) )