Statistical Analysis of Server Logs - Correctness of Extrapolation - math

We had an ISP failure for about 10 minutes one day, which unfortunately occurred during a hosted exam that was being written from multiple locations.
Unfortunately, this resulted in the loss of postback data for candidates' current page in progress.
I can reconstruct the flow of events from the server log. However, of 317 candidates, 175 were using a local proxy, which means they all appear to come from the same IP. I've analyzed the data from the remaining 142 (45%), and come up with some good numbers as to what happened with them.
Question: How correct is it to multiply all my numbers by 317/142 to achieve probable results for the entire set? What would be my region of (un)certainty?
Please, no guesses. I need someone who didn't fall asleep in stats class to answer.
EDIT: by numbers, I was refering to counts of affected individuals. for example, 5/142 showed evidence of a browser crash during the session. How correct is the extrapolation of 11/317 having browser crashes?

I'm not sure exactly what measurements we are talking about, but for now let's assume that you want something like the average score. No adjustment is necessary for estimating the mean score of the population (the 317 candidates). Just use the mean of the sample (the 142 whose data you analyzed).
To find your region of uncertainty you can use the formula given in the NIST statistics handbook. You must first decide how uncertain you are willing to be. Let's assume that you want 95% confidence that the true population mean lies within the interval. Then, the confidence interval for the true population mean will be:
(sample mean) +/- 1.960*(sample standard deviation)/sqrt(sample size)
There are further corrections you can make to take credit for having a large sample relative to the population. They will tighten the confidence interval by about 1/4, but there are plenty of assumptions that the above calculation makes that already make it less conservative. One assumption is that the scores are approximately normally distributed. The other assumption is that the sample is representative of the population. You mentioned that the missing data are all from candidates using the same proxy. The subset of the population that used that proxy could be very different from the rest.
EDIT: Since we are talking about a proportion of the sample with an attribute, e.g. "browser crashed", things are a little different. We need to use a confidence interval for a proportion, and convert it back to a number of successes by multiplying by the population size. This means that our best-guess estimate of the number of crashed browsers is 5*317/142 ~= 11 as you suggested.
If we once again ignore the fact that our sample is nearly half of the population, we can use the Wilson confidence interval of a proportion. A calculator is available online to handle the formula for you. The output from the calculator and the formula is upper and lower limits for the fraction in the population. To get a range for the number of crashes, just multiply the upper and lower limits by (population size - sample size) and add back the number of crashes in the sample. While we could simply multiply by the population size to get the interval, that would ignore what we already know about our sample.
Using the procedure above gives a 95% C.I. of 7.6 to 19.0 for the total number of browser crashes in the population of 317, based on 5 crashes in the 142 sample points.

Related

COMSOL: Diffusion in Transport of Diluted Species Produces Unphysical Results

I am simulating Transport of Diluted Species inside a pipe segment in COMSOL Multiphysics. I have specified an initial concentration which produces a concentration distribution around a slice through the pipe at t=0. Moreover, I have a point probe a little bit upstream (I am using laminar flow for convection). I am plotting the concentration at this point dependent on time.
To investigate whether the model produces accurate (i.e. physically realistic) results, I am varying the diffusion coefficient D. This is where i noticed unrealistic behavior: For a large range of different diffusion coefficients, the concentration graph at the point probe does not change. This is unphysical, since e.g. higher diffusion coefficients should lead to a more spread out distribution at the point probe.
I already did a mesh refinement study and found, that the result strongly depends on mesh resolution. Therefore, I am now using the highest mesh resolution (extremely fine). Regardless, the concentration results still do not change for varying diffusion coefficients.
What could be the reason for this unphysical behavior? I already know it is not due to mesh resolution or relative tolerance of the solver.
After a lot of time spent on this simulation, I concluded that the undesired effects are indeed due to numerical diffusion, as suggested by 2b-t. Of course, it is impossible to be certain that this is actually the reason. However, I investigated pretty much any other potential culprit in the simulation - without any new insights.
To work around this issue of numerical diffusion, I switched to Particle-Based Simulation (PBS) and approximated the concentration as the normalized number of particles inside a small receiver volume. This method provides a good approximation for the concentration for large particle numbers and a small receiver volume.
By doing this, I produced results that are in very good agreement with results know from the literature.

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

Testing CSR on lpp with R

I have recently posted a "very newie to R" question about the correct way of doing this, if you are interested in it you can find it [here].1
I have now managed to develop a simple R script that does the job, but now the results are what troubles me.
Long story short I'm using R to analyze lpp (Linear Point Pattern) with mad.test.That function performs an hypothesis test where the null hypothesis is that the points are randomly distributed. Currently I have 88 lpps to analyze, and according to the p.value 86 of them are randomly distributed and 2 of them are not.
These are the two not randomly distributed lpps.
Looking at them you can see some kind of clusters in the first one, but the second one only has three points, and seems to me that there is no way one can assure only three points are not corresponding to a random distribution. There are other tracks with one, two, three points but they all fall into the "random" lpps category, so I don't know why this one is different.
So here is the question: how many points are too little points for CSR testing?
I have also noticed that these two lpps have a much lower $statistic$rank than the others. I have tried to find what that means but I'm clueless righ now, so here is another newie question: Is the $statistic$rank some kind of quality analysis indicator, and thus can I use it to group my lpp analysis into "significant ones" and "too little points" ones?
My R script and all the shp files can be downloaded from here(850 Kb).
Thank you so much for your help.
It is impossible to give an universal answer to the question about how many points is needed for an analysis. Usually 0, 1 and 2 are too few for a standalone analysis. However, if they are part of repeated measurements of the same thing they might be interesting still. Also, I would normally say that your example with 3 points is too few to say anything interesting. However, an extreme example would be if you have a single long line segment where one point occurs close to one end and two other occur close to each other at the other end. This is not so likely to happen for CSR and you may be inclined to not believe that hypothesis. This appears to be what happened in your case.
Regarding your question about the rank you might want to read a bit more up on the Monte Carlo test you are preforming. Basically, you summarise the point pattern by a single number (maximum absolute deviation of linear K) and then you look at how extreme this number is compared to numbers generated at random from CSR. Assuming you use 99 simulations of CSR you have 100 numbers in total. If your data ranks as the most extreme ($statistic$rank==1) among these it has p-value 1%. If it ranks as the 50th number the p-value is 50%. If you used another number of simulations you have to calculate accordingly. I.e. with 199 simulations rank 1 is 0.5%, rank 2 is 1%, etc.
There is a fundamental problem here with multiple testing. You are applying a hypothesis test 88 times. The test is (by default) designed to give a false positive in 5 percent (1 in 20) of applications, so if the null hypothesis is true, you should expect 88 /20 = 4.4 false positives to have occurred your 88 tests. So getting only 2 positive results ("non-random") is entirely consistent with the null hypothesis that ALL of the patterns are random. My conclusion is that the patterns are random.

3 standard deviations of the mean

I have a data set. It's biological material. I have put in my standard deviations and I can see that all of my data bar 2 data points are within 3sd of the mean.
Is it accepted that data points that fall within 3sd of the mean are within normal variation?
Or is the dependant on the range and dispersement of the data? I'm not a mathematician. Just somebody trying to work out if I have a process in control. I have always understood 3sd to represent 95% of data and therefore data inside this is within normal distribution and not worth investigating. However I am often asked to investigate data that is well within 2sd based on how the chart looks!.
When should one be investigating data as abnormal when using standard deviations?
Many thanks in advance for any help
You should take a look at the 68–95–99.7 rule.
About 95% (95.45%) of your data will fall within two standard deviations from the mean, if your data follows a normal distribution. If the data follows another distribution, you can say by Chebyshev's inequality that at least 75% of the data necessarily will fall within two standard deviations. Assuming a normal distribution, about 99.7% (99.73%) of the data will fall within three standard deviations of the mean. If not a normal distribution, at least 89% (88.8888%) will fall there.
Note that even if your data follows a normal distribution, chance (sampling error) will make it so that those percentages are not exactly the case.
So the numbers do depend on your data, especially the kind of distribution of the data and the number of data points. If you have 1000 data points, you still will get about 3 points outside the 3 standard deviations.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Resources