How does Sum Square aggregation work and how to use it for the calculation of the Standard Deviation in icCube - iccube

One of the aggregation types in icCube is "Sum Square".
I thought it would calculate the sum of squares as defined in the link, but I get different results.
What exactly is the Sum Square aggregation and how can I use it to calculate the standard deviation in icCube?

You can check how the algo to calculate the Variance and Standard Deviation in this wikipedia article.
You need count number of rows (n), sum of the rows (Sum) and the sum of squares (SumSq) :
Var = (SumSq − (Sum × Sum) / n) / (n − 1)
The Std Dev is the root of the variance (sqrt).

Related

Optimum number of permutations to use for estimating set similarity using min hash

Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures.
How should I set my k value? Since setting it to a really high value would increase computation time significantly, what could be the least value of k which can give me a good jaccard index estimate?
Given error tolerance e>0 and delta, how can I determine the minimum value of k such that the jaccard index is between (1-e)jaccard_estimate and (1+e)jaccard_estimate with probability greater than or equal to (1-delta)?
I believe this can be derived using chernoff inequality bound, but I'm unable to figure about how to go about it. Any help would be appreciated. Thanks in advance!
If J' denotes the estimate and J is the true Jaccard similarity, (k * J') follows a binomial distribution with parameters k and J. As a consequence, the variance of the estimate is Var(J') = J(1-J)/k <= 1/(4*k). Therefore the standard deviation of J is bounded by stdev(J') <= 1/(2*sqrt(k)).

Formula of computing the Gini Coefficient in fastgini

I use the fastgini package for Stata (https://ideas.repec.org/c/boc/bocode/s456814.html).
I am familiar with the classical formula for the Gini coefficient reported for example in Karagiannis & Kovacevic (2000) (http://onlinelibrary.wiley.com/doi/10.1111/1468-0084.00163/abstract)
Formula I:
Here G is the Gini coefficient, µ the mean value of the distribution, N the sample size and y_i the income of the ith sample unit. Hence, the Gini coefficient computes the difference between all available income pairs in the data and calculates the total of all absolute differences.
This total is then normalized by dividing it by population squared times mean income (and multiplied by two?).
The Gini coefficient ranges between 0 and 1, where 0 means perfect equality (all individuals earn the same) and 1 refers to maximum inequality (1 person earns all the income in the country).
However the fastgini package refers to a different formula (http://fmwww.bc.edu/repec/bocode/f/fastgini.html):
Formula II:
fastgini uses formula:
i=N j=i
SUM W_i*(SUM W_j*X_j - W_i*X_i/2)
i=1 j=1
G = 1 - 2* ----------------------------------
i=N i=N
SUM W_i*X_i * SUM W_i
i=1 i=1
where observations are sorted in ascending order of X.
Here W seems to be the weight, which I don't use, therefore it should be 1 (?). I’m not sure whether formula I and formula II are the same. There are no absolute differences and the result is subtracted from 1 in formula II. I have tried to transform the equations but I don’t get any further.
Could someone give me a hint whether both ways of computing (formula I + formula II) are equivalent?

Partial sums of harmonic series

Formula:
I was told by my math teacher that it is impossible to calculate from the formula above n that is neccesary for sum to exceed 40 ( sum > 40), and know the sum in 50 decimals precision.
(in short: First n that is neccesary for sum > 40, and what would that sum be in 50 decimals precision)
I tryed writing c++ program for this, but realized after tno of optimizations that it would take just way too long.
H_n is bounded below by ln n + gamma where gamma is the Euler-Mascheroni constant (http://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant). So you can start by finding n such that \ln n + gamma = 40. Solving, you get ln n = 40 - gamma, n = e^(40-gamma), which is quite straightforward to calculate. Once you know the ballpark, you can use a binary search and more accurate over and under estimates for H_n (see the asymptotic expansion at http://en.wikipedia.org/wiki/Harmonic_number#Calculation; there are many references that can provide more detail).
Why would that be impossible? It's
40.00000000000000000202186036912232961108532260403356
Steps to get there:
Ask Wolfram Alpha for the number n where the sum equals 40.
You'll get something around 1.32159290357566702732792368 10^17. Pick the next higher integer.
Compute the sum for n = 132159290357566703.
Click on "More digits" until satisfied.

Difference between Hmisc wtd.var and SAS proc Mean generated weighted variance

I'm getting different results from R and SAS when I try to calculate a weighted variance. Does anyone know what might be causing this difference?
I create vectors of weights and values and I then calculate the weighted variance using the
Hmisc library wtd.var function:
library(Hmisc)
wt <- c(5, 5, 4, 1)
x <- c(3.7,3.3,3.5,2.8)
wtd.var(x,weights=wt)
I get an answer of:
[1] 0.0612381
But if I try to reproduce these results in SAS I get a quite different result:
data test;
input wt x;
cards;
5 3.7
5 3.3
4 3.5
1 2.8
;
run;
proc means data=test var;
var x;
weight wt;
run;
Results in an answer of
0.2857778
You probably have a difference in how the variance is calculated. SAS gives you an option, VARDEF, which may help here.
proc means data=test var vardef=WDF;
var x;
weight wt;
run;
That on your dataset gives a variance similar to r. Both are 'right', depending on how you choose to calculate the weighted variance. (At my shop we calculate it a third way, of course...)
Complete text from PROC MEANS documentation:
VARDEF=divisor specifies the divisor to use in the calculation of the
variance and standard deviation. The following table shows the
possible values for divisor and associated divisors.
Possible Values for VARDEF=
Value Divisor Formula for Divisor
DF degrees of freedom n - 1
N number of observations n
WDF sum of weights minus one ([Sigma]iwi) - 1
WEIGHT | WGT sum of weights [Sigma]iwi
The procedure computes the variance as CSS/Divisor, where CSS
is the corrected sums of squares and equals Sum((Xi-Xbar)^2). When you
weight the analysis variables, CSS equals sum(Wi*(Xi-Xwbar)^2), where
Xwbar is the weighted mean.
Default: DF Requirement: To compute the standard error of the mean,
confidence limits for the mean, or the Student's t-test, use the
default value of VARDEF=.
Tip: When you use the WEIGHT statement and
VARDEF=DF, the variance is an estimate of Sigma^2, where the
variance of the ith observation is Sigma^2/wi and wi is the
weight for the ith observation. This method yields an estimate of the
variance of an observation with unit weight.
Tip: When you use the
WEIGHT statement and VARDEF=WGT, the computed variance is
asymptotically (for large n) an estimate of Sigma^2/wbar, where
wbar is the average weight. This method yields an asymptotic
estimate of the variance of an observation with average weight.

Calculate weighted mean point from array of 3D points

I am writing the program in Cocoa but I think that the solution must be quite universal.
I have a set of points represented by 3D vectors. Each point has a weight assigned to it. The weight is in the range from 0 to 1. The sum of all weights isn't equal to 1.
How should the weighted mean point be calculated from such set?
Either programmatic or pure mathematical solution will be helpful.
Of course if Cocoa has some specific tools for solving this task, I would very appreciate this information.
Simply sum all vectors scaled by their weight. Finally, divide by the sum of all weights. This has the same effect as first normalizing all weights to sum to 1.
Pseudo-code:
sum = [0, 0, 0]
totalWeights = 0
for each point p with associated weight w:
sum += p * w
totalWeights += w
mean = sum / totalWeights

Resources