3 standard deviations of the mean - math

I have a data set. It's biological material. I have put in my standard deviations and I can see that all of my data bar 2 data points are within 3sd of the mean.
Is it accepted that data points that fall within 3sd of the mean are within normal variation?
Or is the dependant on the range and dispersement of the data? I'm not a mathematician. Just somebody trying to work out if I have a process in control. I have always understood 3sd to represent 95% of data and therefore data inside this is within normal distribution and not worth investigating. However I am often asked to investigate data that is well within 2sd based on how the chart looks!.
When should one be investigating data as abnormal when using standard deviations?
Many thanks in advance for any help

You should take a look at the 68–95–99.7 rule.
About 95% (95.45%) of your data will fall within two standard deviations from the mean, if your data follows a normal distribution. If the data follows another distribution, you can say by Chebyshev's inequality that at least 75% of the data necessarily will fall within two standard deviations. Assuming a normal distribution, about 99.7% (99.73%) of the data will fall within three standard deviations of the mean. If not a normal distribution, at least 89% (88.8888%) will fall there.
Note that even if your data follows a normal distribution, chance (sampling error) will make it so that those percentages are not exactly the case.
So the numbers do depend on your data, especially the kind of distribution of the data and the number of data points. If you have 1000 data points, you still will get about 3 points outside the 3 standard deviations.

Related

How to apply principal component analysis to standardised multicentric data?

I have a question about principal component analysis.
I am working with a dataset with 2 cohorts from 2 different centres. From each center I have a control group and 2 patient subgroups (drug-resistant and drug-responsive). My objective is to analyse neurocognitive data, which all subjects received during the study. The problem is, the cognitive tests applied differ slightly across centres. I therefore standardized the raw score values in each patient subgroup relative to the control group of their respective center. Still I am left with a big dataset of z-scores and would like to further reduce dimensionality with PCA.
My question is, does it make sense to apply PCA after standardising data this way? (not sure if I can call them z-scores has I standardised them relative to the mean and standard deviation of the control of their respective centre and not of the entire sample!) The mean of the columns will therefore not be = 0. Would it still be legitimate to apply PCA? And do you think I should scale the variables again?
Any suggestions or comments are much appreciated!
Best wishes,
Bernardo

How to compare two indicators with their mean value and standard deviation?

I was reading the indicators shown in papers when I found the indicators are compared based on their mean value and standard deviation. + means better, - means worse, and ~ means they are close to each other.
A smaller value is wanted for the indicators mentioned above. As a result, it is clear that if the mean value and the standard deviation are both smaller than others, the value of indicator is better.
However, I wonder that if the mean values of two data are close to each other, but standard deviation differs, how could someone compare which one is better, or these two indicators are close to each other?For example, 0.0550±0.0024 and 0.0422±0.0010, and the result is ~ according to the paper. Another example is 4.1128±3.4048 and 2.6551±1.1333, the result of comparison is also ~.However, there's no clue for me to fully understand how to compare these two indicators.

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

Correct way to standardize/scale/normalize multiple variables following power law distribution for use in linear combination

I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.

Statistical Analysis of Server Logs - Correctness of Extrapolation

We had an ISP failure for about 10 minutes one day, which unfortunately occurred during a hosted exam that was being written from multiple locations.
Unfortunately, this resulted in the loss of postback data for candidates' current page in progress.
I can reconstruct the flow of events from the server log. However, of 317 candidates, 175 were using a local proxy, which means they all appear to come from the same IP. I've analyzed the data from the remaining 142 (45%), and come up with some good numbers as to what happened with them.
Question: How correct is it to multiply all my numbers by 317/142 to achieve probable results for the entire set? What would be my region of (un)certainty?
Please, no guesses. I need someone who didn't fall asleep in stats class to answer.
EDIT: by numbers, I was refering to counts of affected individuals. for example, 5/142 showed evidence of a browser crash during the session. How correct is the extrapolation of 11/317 having browser crashes?
I'm not sure exactly what measurements we are talking about, but for now let's assume that you want something like the average score. No adjustment is necessary for estimating the mean score of the population (the 317 candidates). Just use the mean of the sample (the 142 whose data you analyzed).
To find your region of uncertainty you can use the formula given in the NIST statistics handbook. You must first decide how uncertain you are willing to be. Let's assume that you want 95% confidence that the true population mean lies within the interval. Then, the confidence interval for the true population mean will be:
(sample mean) +/- 1.960*(sample standard deviation)/sqrt(sample size)
There are further corrections you can make to take credit for having a large sample relative to the population. They will tighten the confidence interval by about 1/4, but there are plenty of assumptions that the above calculation makes that already make it less conservative. One assumption is that the scores are approximately normally distributed. The other assumption is that the sample is representative of the population. You mentioned that the missing data are all from candidates using the same proxy. The subset of the population that used that proxy could be very different from the rest.
EDIT: Since we are talking about a proportion of the sample with an attribute, e.g. "browser crashed", things are a little different. We need to use a confidence interval for a proportion, and convert it back to a number of successes by multiplying by the population size. This means that our best-guess estimate of the number of crashed browsers is 5*317/142 ~= 11 as you suggested.
If we once again ignore the fact that our sample is nearly half of the population, we can use the Wilson confidence interval of a proportion. A calculator is available online to handle the formula for you. The output from the calculator and the formula is upper and lower limits for the fraction in the population. To get a range for the number of crashes, just multiply the upper and lower limits by (population size - sample size) and add back the number of crashes in the sample. While we could simply multiply by the population size to get the interval, that would ignore what we already know about our sample.
Using the procedure above gives a 95% C.I. of 7.6 to 19.0 for the total number of browser crashes in the population of 317, based on 5 crashes in the 142 sample points.

Resources