Confidence Interval and Standard error of Skewness and Kurtosis - r

Please tell me how to calculate skewness and kurtosis along with their respective standard error and confidence interval associated with it(i.e. SE of Skewness and S.E of Kurtosis) I found two packages
1) package:'measure' can only calculate skewness and kurtosis
2) package:'rela' can calcuate both skewness and kurtosis but uses bootstrap by default and no command to turn it off during the calculation.

I'm simply copying and pasting the code published by Howard Seltman in here:
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
spssSkewKurtosis=function(x) {
w=length(x)
m1=mean(x)
m2=sum((x-m1)^2)
m3=sum((x-m1)^3)
m4=sum((x-m1)^4)
s1=sd(x)
skew=w*m3/(w-1)/(w-2)/s1^3
sdskew=sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
kurtosis=(w*(w+1)*m4 - 3*m2^2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1^4)
sdkurtosis=sqrt( 4*(w^2-1) * sdskew^2 / ((w-3)*(w+5)) )
mat=matrix(c(skew,kurtosis, sdskew,sdkurtosis), 2,
dimnames=list(c("skew","kurtosis"), c("estimate","se")))
return(mat)
}
To get skewness and kurtosis of a variable along with their standard errors, simply run this function:
x <- rnorm(100)
spssSkewKurtosis(x)
## estimate se
## skew -0.684 0.241
## kurtosis 0.273 0.478

The standard errors are valid for normal distributions, but not for other distributions. To see why, you can run the following code (which uses the spssSkewKurtosis function shown above) to estimate the true confidence level of the interval obtained by taking the kurtosis estimate plus or minus 1.96 standard errors:
set.seed(12345)
Nsim = 10000
Correct = numeric(Nsim)
b1.ols = numeric(Nsim)
b1.alt = numeric(Nsim)
for (i in 1:Nsim) {
Data = rnorm(1000)
Kurt = spssSkewKurtosis(Data)[2,1]
seKurt = spssSkewKurtosis(Data)[2,2]
LowerLimit = Kurt -1.96*seKurt
UpperLimit = Kurt +1.96*seKurt
Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit
}
TrueConfLevel = mean(Correct)
TrueConfLevel
This gives you 0.9496, acceptably close to the expected 95%, so the standard errors work as expected when the data come from a normal distribution. But if you change Data = rnorm(1000) to Data = runif(1000), then you are assuming that the data come from a uniform distribution, whose theoretical (excess) kurtosis is -1.2. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit to Correct[i] = LowerLimit <= -1.2 & -1.2 <= UpperLimit gives the result 1.0, meaning that the 95% intervals were always correct, rather than correct for 95% of the samples. Hence, the standard error seems to be overestimated (too large) for the (light-tailed) uniform distribution.
If you change Data = rnorm(1000) to Data = rexp(1000), then you are assuming that the data come from an exponential distribution, whose theoretical (excess) kurtosis is 6.0. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit to Correct[i] = LowerLimit <= 6.0 & 6.0 <= UpperLimit gives the result 0.1007, meaning that the 95% intervals were correct only for 10.07% of the samples, rather than correct for 95% of the samples. Hence, the standard error seems to be underestimated (too small) for the (heavy-tailed) exponential distribution.
Those standard errors are grossly incorrect for non-normal distributions, as the simulation above shows. Thus, the only use of those standard errors is to compare the estimated kurtosis with the expected theoretical normal value (0.0); e.g., using a test of hypothesis. They cannot be used to construct a confidence interval for the true kurtosis.

#HBat is right: provided your sample data is Gaussian, you can compute the standard error using the equation from wikipedia
n = len(sample)
se_skew = ((6*n*(n-1))/((n-2)*(n+1)*(n+3)))**0.5
However, #BigBendRegion is also right: if your data is not Gaussian, this does not work. Then you may need to bootstrap.
R has the DescTools package that can bootstrap confidence intervals for skew (among other things). It can be included in python using rpy2 like so:
""" Import rpy2 and the relevant package"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
DescTools = importr('DescTools')
""" You will probably need this if you want to work with numpy arrays"""
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
def compute_skew(data, confidence_level=0.99):
""" Compute the skew and confidence interval using rpy2, DescTools
#param data
#return dict with keys: skew, skew_ci_lower, skew_ci_upper"""
d = {}
d["skew"], d["skew_ci_lower"], d["skew_ci_upper"] = DescTools.Skew(data, conf_level=confidence_level)
return d
""" Call the function on your data (assuming that is saved in a variable named sample)"""
print(compute_skew(sample))

try package psych:
> a <- data.frame(cola=rep(c('A','B','C'),100),colb=sample(1:1000,300),colc=rnorm(300))
> describe(a)
vars n mean sd median trimmed mad min max range skew kurtosis se
cola* 1 300 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00 0.00 -1.51 0.05
colb 2 300 511.76 285.59 506.50 514.21 362.50 1.00 999.00 998.00 -0.04 -1.17 16.49
colc 3 300 0.12 1.04 0.05 0.10 1.07 -2.54 2.91 5.45 0.12 -0.24 0.06
> describe(a)$skew
[1] 0.00000000 -0.04418551 0.11857609

def skew_kurt(dataframe: pd.DataFrame) -> pd.DataFrame:
out = []
for col in dataframe:
x = dataframe[col]
sd = x.std()
if sd == 0:
out.append({name: np.nan for name in ['skew stat', 'skew se', 'kurt stat', 'kurt se']})
continue
w, m1 = len(x), x.mean()
dif = x - m1
m2, m3, m4 = tuple([(dif ** i).sum() for i in range(2, 5)])
skew = w * m3 / (w - 1) / (w - 2) / sd ** 3
skew_se = np.sqrt(6 * w * (w - 1) / ((w - 2) * (w + 1) * (w + 3)))
kurt = (w * (w + 1) * m4 - 3 * m2 ** 2 * (w - 1)) / ((w - 1) * (w - 2) * (w - 3) * sd ** 4)
kurt_se = np.sqrt(4 * (w ** 2 - 1) * skew_se ** 2 / ((w - 3) * (w + 5)))
out.append({'skew stat': skew, 'skew se': skew_se, 'kurt stat': kurt, 'kurt se': kurt_se})
dataframe = pd.DataFrame(out, index = list(dataframe))
dataframe['skew<2'] = np.absolute(dataframe['skew stat']) < 2 * dataframe['skew se']
dataframe['kurt<2'] = np.absolute(dataframe['kurt stat']) < 2 * dataframe['kurt se']
return dataframe[['skew stat', 'skew se', 'skew<2', 'kurt stat', 'kurt se', 'kurt<2']]

Related

Using R to sample 3 proportion variables so that the three samples add to 1

I have a data set that is split into 3 profiles
Profile 1 = 0.478 (95% confidence interval: 0.4, 0.56)
Profile 2 = 0.415 (95% confidence interval: 0.34, 0.49)
Profile 3 = 0.107 (95% confidence interval: 0.06, 0.15)
Profile 1 + Profile 2 + Profile 3 = 1
I want to create a stochastic model that selects a value for each profile from each proportion's confidence interval. I want to keep that these add up to one. I have been using
pro1_prop<- rpert (1, 0.4, 0.478, 0.56)
pro2_prop<- rpert (1, 0.34, 0.415, 0.49)
pro3_prop<- 1- (pro1_prop + pro2_prop)
But this does not seem robust enough. Also on some iterations, (pro1_prop + pro2_prop) >1 which results in a negative value for pro3_prop. Is there a better way of doing this? Thank you!
It is straightforward to sample from the posterior distributions of the proportions using Bayesian methods. I'll assume a multinomial model, where each observation is one of the three profiles.
Say the counts data for the three profiles are 76, 66, and 17.
Using a Dirichlet prior distribution, Dir(1/2, 1/2, 1/2), the posterior is also Dirichlet-distributed: Dir(76.5, 66.5, 17.5), which can be sampled using normalized random gamma variates.
x <- c(76, 66, 17) # observations
# take 1M samples of the proportions from the posterior distribution
theta <- matrix(rgamma(3e6, rep(x + 1/2, each = 1e6)), ncol = 3)
theta <- theta/rowSums(theta)
head(theta)
#> [,1] [,2] [,3]
#> [1,] 0.5372362 0.3666786 0.09608526
#> [2,] 0.4008362 0.4365053 0.16265852
#> [3,] 0.5073144 0.3686412 0.12404435
#> [4,] 0.4752601 0.4367119 0.08802793
#> [5,] 0.4428575 0.4520680 0.10507456
#> [6,] 0.4494075 0.4178494 0.13274311
# compare the Bayesian credible intervals with the frequentist confidence intervals
cbind(
t(mapply(function(i) quantile(theta[,i], c(0.025, 0.975)), seq_along(x))),
t(mapply(function(y) setNames(prop.test(y, sum(x))$conf.int, c("2.5%", "97.5%")), x))
)
#> 2.5% 97.5% 2.5% 97.5%
#> [1,] 0.39994839 0.5537903 0.39873573 0.5583192
#> [2,] 0.33939396 0.4910900 0.33840295 0.4959541
#> [3,] 0.06581214 0.1614677 0.06535702 0.1682029
If samples within the individual 95% CIs are needed, simply reject samples that fall outside the desired interval.
TL;DR: Sample all three values (for example from a pert distribution, as you did) and norm those values afterwards so they add up to one.
Sampling all three values independently from each other and then dividing by their sum so that the normed values add up to one seems to be the easiest option as it is quite hard to sample from the set of legal values directly.
Legal values:
The downside of my approach is that the normed values are not necessarily legal (i.e. in the range of the confidence intervals) any more. However, for these values using a pert distribution, this only happens about 0.5% of the time.
Code:
library(plotly)
library(freedom)
library(data.table)
# define lower (L) and upper (U) bounds and expected values (E)
prof1L <- 0.4
prof1E <- 0.478
prof1U <- 0.56
prof2L <- 0.34
prof2E <- 0.415
prof2U <- 0.49
prof3L <- 0.06
prof3E <- 0.107
prof3U <- 0.15
dt <- as.data.table(expand.grid(
Profile1 = seq(prof1L, prof1U, by = 0.002),
Profile2 = seq(prof2L, prof2U, by = 0.002),
Profile3 = seq(prof3L, prof3U, by = 0.002)
))
# color based on how far the points are away from the center
dt[, color := abs(Profile1 - prof1E) + abs(Profile2 - prof2E) + abs(Profile3 - prof3E)]
# only keep those points that (almost) add up to one
dt <- dt[abs(Profile1 + Profile2 + Profile3 - 1) < 0.01]
# plot the legal values
fig <- plot_ly(dt, x = ~Profile1, y = ~Profile2, z = ~Profile3, color = ~color, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
fig
# try to simulate the legal values:
# first sample without considering the condition that the profiles need to add up to 1
nSample <- 100000
dtSample <- data.table(
Profile1Sample = rpert(nSample, prof1L, prof1U, prof1E),
Profile2Sample = rpert(nSample, prof2L, prof2U, prof2E),
Profile3Sample = rpert(nSample, prof3L, prof3U, prof3E)
)
# we want to norm the samples by dividing by their sum
dtSample[, SampleSums := Profile1Sample + Profile2Sample + Profile3Sample]
dtSample[, Profile1SampleNormed := Profile1Sample / SampleSums]
dtSample[, Profile2SampleNormed := Profile2Sample / SampleSums]
dtSample[, Profile3SampleNormed := Profile3Sample / SampleSums]
# now get rid of the cases where the normed values are not legal any more
# (e.g. Profile 1 = 0.56, Profile 2 = 0.38, Profile 3 = 0.06 => dividing by their sum
# will make Profile 3 have an illegal value)
dtSample <- dtSample[
prof1L <= Profile1SampleNormed & Profile1SampleNormed <= prof1U &
prof2L <= Profile2SampleNormed & Profile2SampleNormed <= prof2U &
prof3L <= Profile3SampleNormed & Profile3SampleNormed <= prof3U
]
# see if the sampled values follow the desired distribution
hist(dtSample$Profile1SampleNormed)
hist(dtSample$Profile2SampleNormed)
hist(dtSample$Profile3SampleNormed)
Histogram of normed sampled values for Profile 1:
Ok, some thoughts on the matter.
Lets think about Dirichlet distribution, as one providing RV summed up to 1.
We're talking about Dir(a1, a2, a3), and have to find needed ai.
From the expression for E[Xi]=ai/Sum(i, ai), it is obvious we could get three ratios solving equations
a1/Sum(i, ai) = 0.478
a2/Sum(i, ai) = 0.415
a3/Sum(i, ai) = 0.107
Note, that we have only solved for RATIOS. In other words, if in the expression for E[Xi]=ai/Sum(i, ai) we multiply ai by the same value, mean will stay the same. So we have freedom to choose multiplier m, and what will change is the variance/std.dev. Large multiplier means smaller variance, tighter sampled values around the means
So we could choose m freely to satisfy three 95% CI conditions, three equations for variance but only one df. So it is not possible in general.
One cold play with numbers and the code

The way to get the same answer by binom.test or prop.test

I'd like to get the same answer by binom.test or prop.test in R for the following question. How can I get the same answer of my manual calculation(0.009903076)?
n=475, H0:p=0.05, H1:p>0.05
What is the probability of phat>0.0733?
n <- 475
p0 <- 0.05
p <- 0.0733
(z <- (p - p0)/sqrt(p0*(1 - p0)/n))
# [1] 2.33
(ans <- 1 - pnorm(z))
# [1] 0.009903076
You can get this from prop.test():
prop.test(n*p, n, p0, alternative="greater", correct=FALSE)
# data: n * p out of n, null probability p0
# X-squared = 5.4289, df = 1, p-value = 0.009903
# alternative hypothesis: true p is greater than 0.05
# 95 percent confidence interval:
# 0.05595424 1.00000000
# sample estimates:
# p
# 0.0733
#
You can't get the result from binom.test() so far as I can tell because n*p is not an integer, it's 34.8175. The binom.test() function only takes an integer values number of successes, so when you convert this to 35 by rounding, p effectively becomes 0.07368421, which makes the rest of your results not match. Even if you had a situation where n*p was an integer, binom.test() would still not produce the same answer because it's not using a normal approximation as your original code does - it's using the binomial distribution to calculate the probability above p0.

Cointegration analysis in R: How do I get the relevant information from `urca::cajorls`?

Consider the cajorls from urca package in R. This is an estimation of the VEC model given the a ca.jo object. How can I by the output of cajorls find the loading matrix alpha? Beta and the other parameters are simply I can't find the loading matrix.
This code below is taken from a textbook. Can you help identify the loading matrix by adding to this piece of code.
library(urca)
set.seed(1234)
n = 250
e1 = rnorm(n, 0, 0.5)
e2 = rnorm(n, 0, 0.5)
e3 = rnorm(n, 0, 0.5)
u1.ar1 = arima.sim(model = list(ar = 0.75), innov = e1, n = n)
u2.ar1 = arima.sim(model = list(ar = 0.3), innov = e2, n = n)
y3 = cumsum(e3)
y1 = 0.8*y3 + u1.ar1
y2 = -0.3*y3 + u2.ar1
y.mat = data.frame(y1,y2,y3)
plot(ts(y.mat))
vecm = ca.jo(y.mat)
jo.results = summary(vecm)
print(jo.results )
# reestimated
vecm.r2 = cajorls(vecm, r = 2)
summary(vecm.r2)
Maybe I should perform operations at mu own?
I ran your skript and found this
print(jo.results)
######################
# Johansen-Procedure #
######################
Test type: maximal eigenvalue statistic (lambda max) , with linear trend
Eigenvalues (lambda):
[1] 0.285347239 0.127915199 0.006887218
Values of teststatistic and critical values of test:
test 10pct 5pct 1pct
r <= 2 | 1.71 6.50 8.18 11.65
r <= 1 | 33.94 12.91 14.90 19.19
r = 0 | 83.32 18.90 21.07 25.75
Eigenvectors, normalised to first column:
(These are the cointegration relations)
y1.l2 y2.l2 y3.l2
y1.l2 1.00000 1.00000000 1.0000000
y2.l2 -43.55337 -0.07138149 0.0528435
y3.l2 -13.58606 -0.73018096 -3.4121605
Weights W:
(This is the loading matrix)
y1.l2 y2.l2 y3.l2
y1.d -0.0007084809 -0.27450042 2.250788e-03
y2.d 0.0174625514 0.03598729 7.150656e-05
y3.d -0.0030589216 -0.02899838 3.086942e-03
Doesn't it say, Wieghts W: (This is the loading matrix)?
Or do you look for something else?

t-distribution in R

I would like to find the t-value for 90% confidence interval with 17 observation.
In Excel, I can do this calculation with t=T.INV.2T(.10, 16)=1.75 however in R I cannot find the correct way to get the same result.
qt(p = 1-.9, df = 17-1) = -1.34
qt(p = (1-.9)/2, df = 17-1) = -1.75 # trying with two-tailed?
What is the function R doing the same computation as T.INV.2T in Excel.
Similarly, we have also T.DIST.2T in Excel, what is the same function in R?
You need the 1 - .1 / 2 = 0.95 quantile from the t-distribution with 17 - 1 = 16 degrees of freedom:
qt(0.95, 16)
# [1] 1.745884
Explanation
Excel describes T.INV.2T as
Returns the two-tailed inverse of the Student's t-distribution
which is the quantile in math talk (though I would never use the term 2 tailed quantile). The p% quantile q is defined as the point which satisfies P(X <= q) >= p%.
In R we get that with the function qt (q for quantile, t for t-distribution). Now we just have to sort out what is meant by a two-tailed inverse. It turns out we are looking for the point q which satisfies P(X <= -|q| | X >= |q|) >= .1. Since the t-distribution is symmetrical this simplifies to P(X >= |q|) >= .1 / 2.
You can easily verify that in R with the use of the probability function pt:
pt(qt(0.05, 16), 16, lower.tail = TRUE) +
pt(qt(0.95, 16), 16, lower.tail = FALSE)
# [1] 0.1
As you correctly guessed, you do it by estimating the two-sided interval (alpha/2 = 0.1/2 = 0.05)
> qt(p = 0.95, df = 16)
[1] 1.745884
So 5 % off the upper and lower interval. I don't know Excel, but I am guessing that's what that function is doing.
As for dist, that is I assume the two-sided CDF
pt(-1.745884, df=16, lower.tail=T) +
pt(1.745884, df=16, lower.tail=F)
which is equal to 0.09999994.

Monte Carlo integration in R : getting the wrong answer [using Hit or Miss]

So I'm using Monte Carlo method to evaluate definite integral of a bunch of functions.
To start with,
y = x ^ (-0.5) ; for x in [0.01,1]
for which, my code in R looks like this
#
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s==1)*(b-a)/m
print(nn)
#
Answer (nn) : 0.99
Actual answer: 1.8
I cannot figure out where I'm going wrong with this. Have I done something wrong?
A number less than 1 to the power of something negative will always be greater than anything less than one, so you shouldn't be surprised when you get a vector of all 1s.
The rectangle you're using is too short (a height of 1). In reality, it should be 10 tall (since 0.01^-0.5=10) is the maximum value.
Then you take the total area of the rectangle and multiply it by the average of s, so the revised code looks like this:
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-10*runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s)*(b-a)/m*10#note that the addition of the area of the rectangle
print(nn)
I got a result of 1.683, which is a lot closer to the real answer.
Edit: made a superfluous multiplication, answer revised slightly
As user1362215 points out, your function should be contained in the rectangle. You get closer to the solution if you increase n. Here is a vectorised solution. Results are in the range.
# Hit and miss
f <- function(x) x ^ (-0.5)
n <- 1000000
a <- 0.01
b <- 1
#ceiling(max(f((seq(0.01,1,by=0.001)))))
#[1] 10
set.seed(5)
x <- runif(n,a,b)
y <- 10*runif(n,0,1)
R <- sum(y < f(x))/n
(b-a)*10*R
#[1] 1.805701
# Repeat a few times to look at the distribution
set.seed(5)
n <- 100000
r <- replicate(1000,sum(10*runif(n,0,1) < f(runif(n,a,b)))/n *(b-a)*10)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.755 1.792 1.800 1.800 1.809 1.845
# Sample mean method for comparison
set.seed(5)
r <- replicate(1000, mean(f(runif(n, a,b)))*(b-a))
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.788 1.798 1.800 1.800 1.803 1.813
Re your edit: I am assuming the x*2 + y^2, [-1,1] you are referring to a circle rather than a function f(z). So really to estimate area of unit circle/Pi by simulation.
f2 <- function(x) sqrt(1-x^2)
s <- seq(-1 , 1 ,by=0.001)
plot(s,f2(s))
# Get the max value of function within the range
c <- ceiling(max(f2(s)))
# [1] 1
n <- 1000000
a <- -1
b <- 1
set.seed(5)
x <- runif(n,a,b)
y <- c*runif(n,0,1)
R <- sum(y < f2(x))/n
(b-a)*c*R
#[1] 1.57063 # multiply it by 2 to get full area
pi/2
#[1] 1.570796
A Monte Carlo alternative to acceptance/rejection is to uniformly generate x values, average the resulting y = f(x) values to estimate the average height, and multiply that by the interval length to get the estimated area. I don't know R well enough, so here it is in Ruby to illustrate the algorithm:
def f(x)
x ** -0.5
end
sum = 0.0
10000.times { sum += f(0.01 + 0.99 * rand) }
print (1.0 - 0.01) * (sum / 10000)
I'm getting results in the range 1.8 +/- 0.02
You can also improve the precision of your estimator by using antithetic random variates - for each x you generate, also use the symmetric x value mirrored about the median of the x's.
Using #user20650's code for guidance for how to do this in R, you can estimate Pi / 2 as follows:
f <- function(x) sqrt(1-x^2)
n <- 100000
a <- -1
b <- 1
range <- b-a
set.seed(5)
r <- replicate(1000, mean(f(runif(n,a,b))) * range)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.566 1.570 1.571 1.571 1.572 1.575
No bounding function is needed for this approach, and generally it yields greater precision than the acceptance/rejection approach.

Resources