How to aggregate / roll up percentile measures - aggregate-functions

There is a dataset which contains aggregated data - aggregated to various dimensions, and down to the hourly level. The main measure is speed which is simply the file size divided by the duration.
The requirement is to see Percentile, Median and Average/Mean summaries.
Mean is simple because we simply create a calculated measure in the MDX and then it works at all aggregation levels i.e. daily/monthly etc.
However Percentile and median are hard. Is there any way in which it is possible to have a calculation for these functions which will roll up correctly? We could add the percentile speed as a column in the ETL when we're reading the raw data, but we'd still need to find a way to then roll it up further?
What is the proper way to roll up these types of measures? It's not uncommon to ask for percentile numbers, so I'm surprised to not see much information on this when I look around.
Maybe the only approach is to have various aggregated tables at the right level, with the right calculation, and then make mondrian use them as agg tables? Or worse case have multiple cubes (!)

OK, so it turns out you cannot roll up percentiles ( and therefore medians which is just a 50th Percentile ) I understand others have had this problem, see this tweet from Kasper here: https://twitter.com/kaspersor/status/308189242788560896
So our solution was a couple of different agg tables to store the relevant stats, and on the main (already aggregated) fact table to store the pre-computed percentile and median stats.

Related

How to sort many time series' by how trending each series is

Hi I am recording data for around 150k items in influx. I have tried grouping by item id and using some of the functions from the docs but they don't seem to show "trend".
As there are a lot of series' to group by. I am currently performing a query on each series to calculate a value, storing it and sorting by that.
I have tried to use Linear Regression (the average angle of the line) but it's not quite meant for this as the X axis are timestamps, which do not correlate to the Y axis values, so end up with a near vertical line. Maybe i can calculate the X values to be something else?
The other issue i have is some series' are much higher values than others, so one series jumping up by 1000 might be huge (very trending) and not a big deal for other series that are always much higher.
Is there a way i can generate a single value from a series that represents how trending the series is, eg its just jumped up quite a lot compared to normal.
Here is an example of one series that is not trending and one that was trending a couple days ago. So the latter would have a higher trend value than the first:
Thanks!
I think similar problems arise naturally in the stock market and in general when detecting outliers.
So there are different way to move. Probably 1 is good enough.
It looks like you have a moving average in the graphs. You could just take the difference to the moving average and see the distribution to evaluate the the appropriate thresholds for you to pay attention. It looks like in the first graph you have an event perhaps relevant. You could just place a threshold like two standard deviations of the average of the difference between the real series and the moving average.
De-trend each series. Even 1) could be good enough (I mean just substraction of real value for the series minus the average for the last X days), you could de-trend using more sophisticated ideas. But that could need more attention for each case, for instance you should be careful with seasonality and so on. Perhaps something line Hodrick Prescott or inline with this: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/.
Perhaps the idea from 1) is more formally described as Bollinger Bands. That help you to know where the time series should be with some probability.
There are more sophisticated ways to identify outliers in time series (as in here: https://towardsdatascience.com/effective-approaches-for-time-series-anomaly-detection-9485b40077f1) or here for a literature review: https://arxiv.org/pdf/2002.04236.pdf

Setting minimum benchmark in variable to increase mean to a certain level

I'm following an Econometrics course in which we use R.
I'm a bit confused about the use of the while-loop for an exercise.
The situation is as follows:
I have a variable 'profit_loss' with a length of 10,000. The variable gives an overview of the financial performance of a company with its profits (positive values) and losses (negative values). From this I have to create another variable for which I have to create a minimum benchmark such that the mean of this variable becomes 20% higher than the mean of the original variable. In other words, I have to cut the losses to a certain point, which will increase the mean to a level where it's 20% higher than the original mean.
I was thinking of using a while-loop but while I can run some easy loops with this, I can't seem to figure out this one.
Any suggestions?
Thank you so much!

How to find outliers in R comparing text to a numerical value?

I am trying to learn R, and finding it difficult to find precisely what I am looking for. There are tons of libraries.
I have a sample data set of data, of 150k first and last names and their salaries.
For fun, I would like to see if any first or last names are associated with significantly higher or lower pay.
,"FirstName","LastName","BasePay"
1,"NATHANIEL","FORD","167411.18"
2,"GARY","JIMENEZ","155966.02"
3,"ALBERT","PARDINI","212739.13"
I have tried using : library("arulesViz") and rules <- apriori(data)
But it seems to try to find correlation to precise salary numbers, not that the salary is relatively high or low.
Any help on this problem to get me started would be really appreciated!
Regards, Steven
I think it's a perfectly legitimate question.
I would use the package dplyr. You can then use the 'group_by' and 'summarise' functions. In your case group_by(FirstName) and then choose any kind of statistic, i.e. mean or median of salary as a measure of bias.

Detect patterns in data set

Given a data set with several records, that are similar to this one:
I want to detect the green dots. This pattern is recurring in a lot of the data records but is not completely identical (sd, variance, min, max etc). This data points are near the minimum and are showing a low variance.
I tried clustering (kmeans, dbscan, mclust) but the result was not very good.
How can I solve this problem? Any ideas?
Dare I say a simple threshold based on the minimum and a percentage?

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Resources