Currently the Gatling report provides the following metrics
minimum
maximum
50th percentile
75th percentile
95th percentile
99th percentile
mean
standard deviation
but it does not provide the arithmetic average.
I have found a related discussion about this in the google group, but am not sure about the outcome.
In general I agree that the average is not a good metric to be used. Percentiles are way more representative - no doubt about that. So from an engineering point of view I don't see a problem with the currently supported metrics.
But a problem arises from a legal point of view (as usual :). It seems there are ill conceived SLAs out there which explicitly define average request duration. Since Gatling does not emit the average metric, boneheaded customers won't accept the "Ready-to-present HTML reports" as a proof that such SLA terms are met...
So my question is: Is there a way to add the average metric to the report output?
mean and average are exactly the same thing.
Related
I want to calculate the 95% coverage rate of an simple OLS estimator.
The (for me) tricky addition is that the independent variable has 91 values that i have to test against each other in order to see which value leads to the best estimate.
For each value of the independent variable i want to draw 1000 samples.
I tried looking up on the theory and also on multiple platforms such as stackoverflow, but i didn't manage to find an appropriate answer.
My biggest question is how to calculate a coverage rate for a 95% confidence interval.
I would deeply appreciate it, if you could provide me with some possibilities or insights.
There is a dataset which contains aggregated data - aggregated to various dimensions, and down to the hourly level. The main measure is speed which is simply the file size divided by the duration.
The requirement is to see Percentile, Median and Average/Mean summaries.
Mean is simple because we simply create a calculated measure in the MDX and then it works at all aggregation levels i.e. daily/monthly etc.
However Percentile and median are hard. Is there any way in which it is possible to have a calculation for these functions which will roll up correctly? We could add the percentile speed as a column in the ETL when we're reading the raw data, but we'd still need to find a way to then roll it up further?
What is the proper way to roll up these types of measures? It's not uncommon to ask for percentile numbers, so I'm surprised to not see much information on this when I look around.
Maybe the only approach is to have various aggregated tables at the right level, with the right calculation, and then make mondrian use them as agg tables? Or worse case have multiple cubes (!)
OK, so it turns out you cannot roll up percentiles ( and therefore medians which is just a 50th Percentile ) I understand others have had this problem, see this tweet from Kasper here: https://twitter.com/kaspersor/status/308189242788560896
So our solution was a couple of different agg tables to store the relevant stats, and on the main (already aggregated) fact table to store the pre-computed percentile and median stats.
I'm experimenting on some movie rating data. Currently doing some hybrid item and user based predictions. Mathimatically I'm unsure how to implement what I want and maybe the answer is just straight forward weighed mean but I feel like there might be some other option.
I have 4 values for now, that I want to get the mean of
item based prediction
user based prediction
Global movie average for given item
Global user average for given user
As this progesses there will be other values I'll need to add to the mix such as weighted similarity, genre weighting and I'm sure a few other things.
For now I want to focus on the data available to me as stated above as much for understanding as anything else.
Here is my theory. To start I want to weight the item and user based prediction equally which will have more weight than the global averages.
I feel though on my very rusty maths and some basic attempts to come up with a less linear solution is to use something like Harmonic mean. but instead of natuarlly tending towards the low mean value tend towards the global average.
e.g
predicted item base rating 4.5
predicted user based rating 2.5
global movie rating 3.8
global user rating 3.6
so the "centre"/global average here would be 3.7
I may be way off base with this as my maths is quite rusty but anyone any thoughts on how I could mathematically represent what I'm thinking?
OR
do you have any thoughts on a different approach
I recommend you to look into "Recommender systems handbook" by F. Ricci et al., 2011. It summarizes all the common approaches in recommender engines and provides all the necessary formulas.
Here is an excerpt from 4.2.3:
As the number of neighbors used in the prediction increases, the rating predicted by the regression approach will tend toward the mean rating of item i. Suppose item i has only ratings at either end of the rating range, i.e. it is either loved or hated, then the regression approach will make the safe decision that the item’s worth is average. [...] On the other hand, the classification approach will predict the rating as the most frequent one given to i. This is more risky as the item will be labeled as either “good” or “bad”.
I have a file with a sequence of event timestamps corresponding to the times at which someone visits a website:
02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:00
02.02.2010 09:00:01
02.02.2010 09:00:03
02.02.2010 09:00:05
02.02.2010 09:00:06
02.02.2010 09:00:06
02.02.2010 09:00:09
02.02.2010 09:00:11
02.02.2010 09:00:11
02.02.2010 09:00:11
etc, for several thousand rows.
I'd like to get an idea how the web hits are distributed over time, over the week etc. I need to know how I should scale the (future) web servers in order to guarantee service availability with a given number of nines. In particuler I need to give upper bounds on the number of almost-concurrent visits.
Are there any resources out ther that explain how to do that? I'm fluent in mathematics and statistics, and I've looked at queuing theory but it seems that that theory assumes the rate of arrival to be independent of the time of the day, which is clearly wrong in my case. And NO, histograms are not the right answer since the result depends heavily on bin width and placement.
You can always place a more flexible model on the arrive rate parameter. For instance, make the arrive rate a function of time, or place some time-series style model on it. Whatever makes sense for your data. The literature typically focuses on the core model because extensions are application specific.
In an extended model, you'll almost certainly want to use Bayesian methods. You are interested in the posterior predictive distribution of the object "almost concurrent events." A recent paper in JASA describes nearly your exact problem, applied to call center data:
Bayesian Forecasting of an Inhomogeneous Poisson Process With Applications to Call Center Data
For a quick solution, don't underestimate the power of histogram style estimators. They are simple nonparametric estimators and you can cross-validate tuning parameters like binwidth and placement. Theoretically this is somewhat unsatisfying, but it would take a day to implement. A fully Bayesian approach likely will dominate, but at significant computational cost.
Well, prepare yourself for a whole lot of 'what's wrong with AWStats/Webalizer/Analog-Stats/favourite-http-log-stats-viewer-of-the-month' responses...
They all do histograms, but thats because they are designed to help give a broad-at-a-glance picture of visitor traffic.
I recommend that you take a look at Splunk to see if it meets your requirements.
If you don't want to use a histogram, could you just graph the kernel density?
Can nearly concurrent visits be defined or approximated as those that occur in the same second? If yes, here is how I would proceed:
For each second in the data calculate the number of visits. This will include some seconds with 0 visits - don't exclude them.
It is probably reasonable to assume that the number of visits per second has a Poisson distribution with a rate that changes over the day, and perhaps over the week. So decide what are the relevant predictors (time of day, day of the week, month?) and use Poisson regression to model the counts. You can use splines for the continuous variables (eg time of day), I believe there are even some "cyclic" splines that can take into account that 11:58PM is close to 00:02 AM. Or you can cut time into smaller discrete pieces, say 10 minute intervals. If you want to be really fancy, incorporate autocorrelation and overdispersion in the model.
Based on the fitted model, you can estimate whatever percentile you want.
Of course, this is pretty fancy statistically, and you have to know what you are doing, but I think it could work.
You're right, most of the theory assumes a Poisson distribution of hits, which you don't have because the rate of hits varies with time of day. However, couldn't you stratify your data into, say, one block for each hour of the day and assume that within a single hour the distribution of hits per second/minute/whatever unit is approximately Poisson? There are probably better ways (from a theoretical perspective), but this way has the advantage of being simple to implement and simple to explain to anyone with any statistical background.
I think you could argue that your hits are distributed according to a poisson distribution where the average and variation vary with the time of day.
To get a good idea of the peak load I'd start with just a scatterplot with the time of the hit on the horizontal axis and the time between that hit and the next hit on the vertical axis.
This should give you a good idea of the height and duration of your peaks. Then you can estimate the parameters of the poisson distribution for a sliding window of a length similar to that duration for every moment of the day. Sort of like a moving average. The area's where mean and variance are lowest will give you a good basis for estimating expected future peak load.
We had an ISP failure for about 10 minutes one day, which unfortunately occurred during a hosted exam that was being written from multiple locations.
Unfortunately, this resulted in the loss of postback data for candidates' current page in progress.
I can reconstruct the flow of events from the server log. However, of 317 candidates, 175 were using a local proxy, which means they all appear to come from the same IP. I've analyzed the data from the remaining 142 (45%), and come up with some good numbers as to what happened with them.
Question: How correct is it to multiply all my numbers by 317/142 to achieve probable results for the entire set? What would be my region of (un)certainty?
Please, no guesses. I need someone who didn't fall asleep in stats class to answer.
EDIT: by numbers, I was refering to counts of affected individuals. for example, 5/142 showed evidence of a browser crash during the session. How correct is the extrapolation of 11/317 having browser crashes?
I'm not sure exactly what measurements we are talking about, but for now let's assume that you want something like the average score. No adjustment is necessary for estimating the mean score of the population (the 317 candidates). Just use the mean of the sample (the 142 whose data you analyzed).
To find your region of uncertainty you can use the formula given in the NIST statistics handbook. You must first decide how uncertain you are willing to be. Let's assume that you want 95% confidence that the true population mean lies within the interval. Then, the confidence interval for the true population mean will be:
(sample mean) +/- 1.960*(sample standard deviation)/sqrt(sample size)
There are further corrections you can make to take credit for having a large sample relative to the population. They will tighten the confidence interval by about 1/4, but there are plenty of assumptions that the above calculation makes that already make it less conservative. One assumption is that the scores are approximately normally distributed. The other assumption is that the sample is representative of the population. You mentioned that the missing data are all from candidates using the same proxy. The subset of the population that used that proxy could be very different from the rest.
EDIT: Since we are talking about a proportion of the sample with an attribute, e.g. "browser crashed", things are a little different. We need to use a confidence interval for a proportion, and convert it back to a number of successes by multiplying by the population size. This means that our best-guess estimate of the number of crashed browsers is 5*317/142 ~= 11 as you suggested.
If we once again ignore the fact that our sample is nearly half of the population, we can use the Wilson confidence interval of a proportion. A calculator is available online to handle the formula for you. The output from the calculator and the formula is upper and lower limits for the fraction in the population. To get a range for the number of crashes, just multiply the upper and lower limits by (population size - sample size) and add back the number of crashes in the sample. While we could simply multiply by the population size to get the interval, that would ignore what we already know about our sample.
Using the procedure above gives a 95% C.I. of 7.6 to 19.0 for the total number of browser crashes in the population of 317, based on 5 crashes in the 142 sample points.