Formula of computing the Gini Coefficient in fastgini - formula

I use the fastgini package for Stata (https://ideas.repec.org/c/boc/bocode/s456814.html).
I am familiar with the classical formula for the Gini coefficient reported for example in Karagiannis & Kovacevic (2000) (http://onlinelibrary.wiley.com/doi/10.1111/1468-0084.00163/abstract)
Formula I:
Here G is the Gini coefficient, µ the mean value of the distribution, N the sample size and y_i the income of the ith sample unit. Hence, the Gini coefficient computes the difference between all available income pairs in the data and calculates the total of all absolute differences.
This total is then normalized by dividing it by population squared times mean income (and multiplied by two?).
The Gini coefficient ranges between 0 and 1, where 0 means perfect equality (all individuals earn the same) and 1 refers to maximum inequality (1 person earns all the income in the country).
However the fastgini package refers to a different formula (http://fmwww.bc.edu/repec/bocode/f/fastgini.html):
Formula II:
fastgini uses formula:
i=N j=i
SUM W_i*(SUM W_j*X_j - W_i*X_i/2)
i=1 j=1
G = 1 - 2* ----------------------------------
i=N i=N
SUM W_i*X_i * SUM W_i
i=1 i=1
where observations are sorted in ascending order of X.
Here W seems to be the weight, which I don't use, therefore it should be 1 (?). I’m not sure whether formula I and formula II are the same. There are no absolute differences and the result is subtracted from 1 in formula II. I have tried to transform the equations but I don’t get any further.
Could someone give me a hint whether both ways of computing (formula I + formula II) are equivalent?

Related

Generate random data based on correlation matrix for multiple timesteps in R

I would like to simulate data for some cases (e.g. nPerson=1000 obversations) at
some consecutive timesteps (e.g. ts = 3) for N intercorrelated variables (e.g. N=5).
The simulation should be based on a correlation matrix (corrMat, nrows=nPerson,.ncols = N).
corrMat should be identical for all timesteps.
I already found out that the MASS package has a function to create
random data fitting the constraints given by corrMat.
t1 <- mvrnorm(nPerson,mu=rep(0, N),Sigma=corrMat,empirical=T)
Now I would like to simulate t2 as a function of t1 and corrMat.
The data of t2 therefore should correlate according to corrMat
and they should also have same variance as the variables of t1.
One important constrained: for the intial values corrMat[i,i] = 1,
for consequtive timesteps it should be posible, that corrMat[i,i] < 1,
because each variable is depending on itsself a timestep before,
but a perfect correlation is notintended.
Maybe there is a variance decomposition of the correlation matrix,
that calculates an error variance for each of the n variables at the
next time step, so that one could calculate the
values at timestep t+1 as sum of the weighted correlations of the
variables at timestep t and then adding a random error,distributed
according to the error variance (with mean of error = 0) that replicates
the correlation matrix again at t+1.
Assuming normal errors:
getRand <- function (range) {
return (rnorm(1,mean=0, sd=range) )
}
That the (very simplified) code for the i-th variable x_i:
x_i[t+1] = 0
for (j:1..N) {
x_i[t+1] = x_i[t+1] + corrMat[i,j] * x_j[t]
}
x_i[t+1] = x_i[t+1] + getRand(sdErr)
So the question would be more specific: how to calculate sdErr?
For simplification I try to assume, that the variance for all variables
should be 1.
Thank you for any hint, how to get one step further!
I will do a mathematical formulation of the problem to stats.stackexchange.com,
as mikeck suggested to discuss details of the correlation problems more
in depth.
I still am interested in finding a geneal formula to calculate sdErr
to use it in the calculation of x_i[t+1].
But meanwhile I found a useful practical solution to the specific question "how to calculate sdErr?" without a formula for sdErr:
(1) simply calculate all variables WITHOUT errors (according to the equation above).
(2) calculate variances of the new variables
(3) calculate (for each i) differences var(x_i[t]) - var(x_i[t+1]) = sdErr ^ 2
So this sdErr can be added to each variable for each new observation.
This should lead to observations at t+1 which at least have the same variances as the observations in t.
Details concercing the question, if the model definition is adequate,
will be part of another post.

Matrix dimension do not mach in regression formula

I'm trying to calculate this regression formula, but I have problem with the dimension calculation, they are not correct:
Where:
X-a matrix with dimensions 200x20, n=200 samples, p=20 predictors,
y-a matrix with dimensions 200x1,
- a sequence of coefficients, dimensions 20x1, and k=1,2,3...
- dimensions 20x200
j- and value from 1...p so from 1...20,
The problem is when I calculate
For example for k=20, k-1=19 i have and the dimensions do not match to do a substraction 200x1 - 200x20 x 1x1 =200x1 - 200x20 will not work.
If I take all the beta vector then it is correct. does this: mean to take the 19th value of Beta and to multiply it with the matrix X?
Source of the formula:
You should be using the entire beta vector at each stage of the calculation.
(Tibshirani has been a bit permissive with his use of notation, perhaps...)
The k is just a counter for which step of the algorithm we are on. Right at the start (k = 0 or "step 0") we initialise the entire beta vector to have all elements equal to zero:
At each step of the algorithm (steps k = 1, 2, 3... and so on) we use our previous estimate of the vector beta ( calculated in step k - 1) to calculate a new improved estimate for the vector beta (). The superscript number is not an index into the vector, rather it is a label telling us at which stage of the algorithm that beta vector was produced.
I hope this makes sense. The important point is that each of the values is a different 20x1 vector.

How to simulate from poisson distribution using simulations from exponential distribution

I am asked to implement an algorithm to simulate from a poisson (lambda) distribution using simulation from an exponential distribution.
I was given the following density:
P(X = k) = P(X1 + · · · + Xk ≤ 1 < X1 + · · · + Xk+1), for k = 1, 2, . . . .
P(X = k) is the poisson with lambda, and Xi is exponential distribution.
I wrote code to simulate the exponential distribution, but have no clue how to simulate a poisson. Could anybody help me about this? Thanks million.
My code:
n<-c(1:k)
u<-runif(k)
x<--log(1-u)/lambda
I'm working on the assumption you (or your instructor) want to do this from first principles rather than just calling the builtin Poisson generator. The algorithm is pretty straightforward. You count how many exponentials you can generate with the specified rate until their sum exceeds 1.
My R is rusty and this sounds like a homework anyway, so I'll express it as pseudo-code:
count <- 0
sum <- 0
repeat {
generate x ~ exp(lambda)
sum <- sum + x
if sum > 1
break
else
count <- count + 1
}
The value of count after you break from the loop is your Poisson outcome for this trial. If you wrap this as a function, return count rather than breaking from the loop.
You can improve this computationally in a couple of ways. The first is to notice that the 1-U term for generating the exponentials has a uniform distribution, and can be replaced by just U. The more significant improvement is obtained by writing the evaluation as maximize i s.t. SUM(-log(Ui) / rate) <= 1, so SUM(log(Ui)) >= -rate.
Now exponentiate both sides and simplify to get
PRODUCT(Ui) >= Exp(-rate).
The right-hand side of this is constant, and can be pre-calculated, reducing the amount of work from k+1 log evaluations and additions to one exponentiation and k+1 multiplications:
count <- 0
product <- 1
threshold = Exp(-lambda)
repeat {
generate u ~ Uniform(0,1)
product <- product * u
if product < threshold
break
else
count <- count + 1
}
Assuming you do the U for 1-U substitution for both implementations, they are algebraically equal and will yield identical answers to within the precision of floating point arithmetic for a given set of U's.
You can use rpois to generate Poisson variates as per above suggestion. However, my understanding of the question is that you wish to do so from first principles rather than using built-in functions. To do this, you need to use the property of the Poisson arrivals stating that the inter-arrival times are exponentially distributed. Therefore we proceed as follows:
Step 1: Generate a (large) sample from the exponential distribution and create vector of cumulative sums. The k-th entry of this vector is the waiting time to the k-th Poisson arrival
Step 2: Measure how many arrivals we see in a unit time interval
Step3: Repeat steps 1 and 2 many times and gather the results into a vector
This will be your sample from the Poisson distribution with the correct rate parameter.
The code:
lambda=20 # for example
out=sapply(1:100000, function(i){
u<-runif(100)
x<--log(1-u)/lambda
y=cumsum(x)
length(which(y<=1))
})
Then you can test the validity vs the built-in function via the Kolmogorov-Smirnov test:
ks.test(out, rpois(100000, lambda))

Cox Regression Hazard Ratio in Percentiles

I computed a Cox proportional hazards regression in R.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I got the hazard ratio (HR, or exp(coef)) for all these covariates, but I'm really only interested in the effects of continuous predictor X. The HR for X is 1.20. X is actually scaled to the sample measurements, such that X has a mean of 0 and SD 1. That is, an individual with a 1 SD increase in X has a 1.23 times higher chance of mortality (the event) than someone with an average value of X (I believe).
I would like to be able to say these results in something that's a bit less awkward, and actually this article does exactly what I would like to. It says:
"In a Cox proportional hazards model adjusting for age, sex and
education, a higher level of total daily physical activity was
associated with a decreased risk of death (hazard ratio=0.71;
95%CI:0.63, 0.79). Thus, an individual with high total daily physical
activity (90th percentile) had about ¼ the risk of death as compared
to an individual with low total daily physical activity (10th
percentile)."
Assuming only the HR (i.e. 1.20) is needed, how does one compute this comparison statement? If you need any other information, please ask me for it.
If you have x1 as your 90th percentile X value and x2 as your 10th percentile X value, and if p,q,r and s (s is1.20 as you have mentioned) and your coefficients of cox regression you need to find exp(p*A + q*B + r*C + s*x1)/exp(p*A + q*B + r*C + s*x2) where A, B, and C can be average values of the variable. This ratio give you the comparison statement.
This question is actually for stats.stackexchange.com though.

Difference between Hmisc wtd.var and SAS proc Mean generated weighted variance

I'm getting different results from R and SAS when I try to calculate a weighted variance. Does anyone know what might be causing this difference?
I create vectors of weights and values and I then calculate the weighted variance using the
Hmisc library wtd.var function:
library(Hmisc)
wt <- c(5, 5, 4, 1)
x <- c(3.7,3.3,3.5,2.8)
wtd.var(x,weights=wt)
I get an answer of:
[1] 0.0612381
But if I try to reproduce these results in SAS I get a quite different result:
data test;
input wt x;
cards;
5 3.7
5 3.3
4 3.5
1 2.8
;
run;
proc means data=test var;
var x;
weight wt;
run;
Results in an answer of
0.2857778
You probably have a difference in how the variance is calculated. SAS gives you an option, VARDEF, which may help here.
proc means data=test var vardef=WDF;
var x;
weight wt;
run;
That on your dataset gives a variance similar to r. Both are 'right', depending on how you choose to calculate the weighted variance. (At my shop we calculate it a third way, of course...)
Complete text from PROC MEANS documentation:
VARDEF=divisor specifies the divisor to use in the calculation of the
variance and standard deviation. The following table shows the
possible values for divisor and associated divisors.
Possible Values for VARDEF=
Value Divisor Formula for Divisor
DF degrees of freedom n - 1
N number of observations n
WDF sum of weights minus one ([Sigma]iwi) - 1
WEIGHT | WGT sum of weights [Sigma]iwi
The procedure computes the variance as CSS/Divisor, where CSS
is the corrected sums of squares and equals Sum((Xi-Xbar)^2). When you
weight the analysis variables, CSS equals sum(Wi*(Xi-Xwbar)^2), where
Xwbar is the weighted mean.
Default: DF Requirement: To compute the standard error of the mean,
confidence limits for the mean, or the Student's t-test, use the
default value of VARDEF=.
Tip: When you use the WEIGHT statement and
VARDEF=DF, the variance is an estimate of Sigma^2, where the
variance of the ith observation is Sigma^2/wi and wi is the
weight for the ith observation. This method yields an estimate of the
variance of an observation with unit weight.
Tip: When you use the
WEIGHT statement and VARDEF=WGT, the computed variance is
asymptotically (for large n) an estimate of Sigma^2/wbar, where
wbar is the average weight. This method yields an asymptotic
estimate of the variance of an observation with average weight.

Resources