How does Kusto series_outliers() calculate anomaly scores? - azure-data-explorer

Can someone please explain how the series_outliers() Kusto function calculates the anomaly scores? I understand that it uses Tukey fences with a min percentile and max percentile given a numeric array, but I would like to know in more details what are the steps/algorithm.
For example, given this table
let T = datatable(val:real)
[
-3, 2.4, 15, 3.9, 5, 6, 4.5, 5.2, 3, 4, 5, 16, 7, 5, 5, 4
]
I found Q1 = 2.4, Q3 = 15, and IQR = 12.6 with a 10%/90% quantile range. So how did it derive these anomaly scores?
[-1.9040785483608571,
-0.10021466044004519,
1.3361954725339347,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.6702443406674186,
0.0,
0.0,
0.0,
0.0]

In that function the 10th and 90th are calculated with linear interpolation, so p10=2.7, p90=11 so IQR=8.3.
In addition, we normalize the score to get a score that is similar to standard Tukey's test (that uses 25th and 75th percentiles), regardless of the specific percentiles we used for calculating the IQR.
The normalization is done by assuming normal distribution and looking at score k=1.5 (that is the common threshold for mild anomalies) when using p25 and p75. So, when using p10, p90 to normalize the score we need to multiply it by 2.772 to make sure that we get k=1.5.
Let's see how it works for -3.0, the first point in your sample data. k=(-3-2.7)/(11-2.7)*2.772=-1.904.
I hope it's clear now.

Related

Clustering values with given threshold

I have several vectors:
a <- c(1.1, 2.9, 3.9, 5.2)
b <- c(1.0, 1.9, 4.0, 5.1)
c <- c(0.9, 2.1, 3.1, 4.1, 5.0, 11.13)
They can have different length.
I want to combine them in such a way to obtain a general vector with averaged values if there are similar meanings in all of the vectors or in any pairs of them; and with an initial meanings if this meaning is only in one vector. For averaging I would like to use a threshold = 0.2.
My explanation could be a bit confusing, but here is the general vector I want to obtain:
d <- c(1, 2, 3, 4, 5.1, 11.13)
I have around 12 vectors and about 2000 values in each vector.
I will be glad for any help
Seems like a clustering problem (clustered by distance). You can try the code below
library(igraph)
v <- sort(c(a, b, c))
tapply(
v,
membership(components(graph_from_adjacency_matrix(as.matrix(dist(v)) <= 0.2 + sqrt(.Machine$double.eps)))),
mean
)
which gives
1 2 3 4 5 6
1.00 2.00 3.00 4.00 5.10 11.13

Normal Distribution bayesAB - negative mean values

A_norm <- rnorm(300, 5, 2)
B_norm <- rnorm(300, -5, 2)
AB2 <- bayesTest(A_norm, B_norm,
priors = c('mu' = 5, 'lambda' = 1, 'alpha' = 3, 'beta' = 1),
distribution = 'normal')
plot(AB2)
summary(AB2)
Using the above code, summary(AB2) shows that the probability of A > B is equal to 0, however when I look at the posterior plots of mu for A and B (plot(AB2)) - the distribution of A looks to be always greater than B with no overlap. I think it has to do with comparing two normal distributions with both a positive and negative mean respectively. However, I cant understand why this would affect it. Can someone help me understand?

Calculate probability table with exponentially increasing dimensions

The goal here is to calculate the expected outcomes and scoring table of a game.
The game outcomes can be described in terms of a weighted dice roll:
We can set the input weights and bonuses
Then we roll N times
Then we apply score modifiers to the outcome table
Then we calculate the expected score of the set of N rolls.
[roll: 1, weight: 0.24, score: 2],
[roll: 2, weight: 0.11, score: 4],
…
[roll: 19, weight: 0.05, score: 7],
[roll: 20, weight: 0.03, score: 20],
example bonus: min_set_score = rolls_per_set * 6.2
if we roll the die twice, we can manually generate the result table:
[rolls: [1,1], weight: 0.24 * 0.24, score: 2 + 2]
[rolls: [1,2], weight: 0.24 * 0.11, score: 2 + 4]
…etc
then [1,1] and [1,2] would actually be adjusted to 6.2.
How would we calculate the result table for 50, 100, 500 rolls? 50 rolls causes 20^50 outcomes, and can no longer be calculated by our naive implementation in a reasonable time frame.
Is this something that can be done with a better library? a more suitable language or technology?
This does not need to be done in real time, we would like to use the results to inform our decisions about how the game elements should affect the weights and bonuses.

Poorly implemented two-sample Kolmogorov-Smirnov test (kstest2) in Matlab?

Am I missing something obvious or Matlab's kstest2 is giving very poor p-values?
Under very poor I mean that I have the suspicion that it is even wrongly implemented.
Help page of kstest2 states that the function calculates asymptotic p-value, though I did not find any reference about which method is used exactly. Anyway, the description further states:
asymptotic p-value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2, such that (n1*n2)/(n1 + n2) ≥ 4
Example 1
Let's take Example 6 from Lehman and D'Abrera (1975):
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
[h,p,ks2stat] = kstest2(sampleA, sampleB, 'Tail', 'unequal');
(n1*n2)/(n1 + n2) = 4 in this case so the p-value should be reasonably accurate.
Matlab yields p = 0.0497, while the solution given in the book is 0.0870.
To validate the solution I used R, which I trust more than Matlab, especially in statistics.
Using ks.test from stats package and ks.boot from Matching package:
ks.test(sampleA, sampleB, alternative = "two.sided")
ks.boot(sampleA, sampleB, alternative = "two.sided")
Both give p = 0.0870.
Example 2
Lets use kstest2's own example to compare Matlab and R results for larger sample size:
rng(1); % For reproducibility
x1 = wblrnd(1,1,1,50);
x2 = wblrnd(1.2,2,1,50);
[h,p,ks2stat] = kstest2(x1,x2);
This yields p = 0.0317. Now, using the same x1 and x2 vectors R gives p = 0.03968.
About 20% difference when very accurate result is expected (n1*n2)/(n1 + n2) = 25.
Am I missing, messing up something?
Is it possible that Matlab's kstest2 performs so poorly as the examples indicate? What approximation, algorithm kstest2 is using? (I can see the implemented code for kstest2, however a reference to book or paper would be much better to understand what is going on.)
I am using Matlab 2016a.
Lehman and D'Abrera (1975). Nonparametrics: Statistical Methods Based on Ranks. 1st edition. Springer.
I think that the correct test to compare with R's ks.test in MATLAB or Octave would be kolmogorov_smirnov_test_2:
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
kolmogorov_smirnov_test_2(sampleA, sampleB)
pval: 0.0878664
The difference appears to be the use of ks versus lambda, i.e.
ks = sqrt (n) * d;
pval = 1 - kolmogorov_smirnov_cdf (ks);
versus
lambda = max((sqrt(n) + 0.12 + 0.11/sqrt(n)) * d , 0);
pval = 1 - kolmogorov_smirnov_cdf (lambda);
I presume the different test statistics arise from the differences in the research papers cited by these 2 functions. If you want a deeper dive into the statistical theory you may want to reach out to CrossValidated.

How to exactly match the result of cumulative distribution function and quantile function?

As we know, quantile function is the inverse cumulative distribution function.
Then for an existed distribute(a vector), how to exactly match the result of cumulative distribution function and quantile function?
Here is an example given in MATLAB.
a = [150 154 151 153 124]
[x_count, x_val] = hist(a, unique(a));
% compute the probability cumulative distribution
p = cumsum(n)/sum(n);
x_out = quantile(a, p)
In the cumulative distribution function, the corresponding relation between cumulative probability and x value should be:
x = 124 150 151 153 154
p = 0.2000 0.4000 0.6000 0.8000 1.0000
But use p and quantile to compute x_out, the result is different with x:
x_out =
137.0000 150.5000 152.0000 153.5000 154.0000
Reference
quantile function
matlab quantile function
From the docs:
For a data vector of five elements such as {6, 3, 2, 10, 1}, the sorted elements {1, 2, 3, 6, 10} respectively correspond to the 0.1, 0.3, 0.5, 0.7, 0.9 quantiles.
So if you wanted to get the exact numbers out that you put in for x, and your x has 5 elements then your p needs to be p = [0.1, 0.3, 0.5, 0.7, 0.9]. The complete algorithm is explicitly defined in the documentation.
You have assumed that to get x back, p should have been [0.2, 0.4, 0.6, 0.8, 1]. But then why not p = [0, 0.2, 0.4, 0.6, 0.8]? Matlab's algorithm seems to just take a linear average of the two methods.
Note that R defines nine different algorithms for quantiles, so your assumptions need to be stated clearly.

Resources