t-distribution in R - r

I would like to find the t-value for 90% confidence interval with 17 observation.
In Excel, I can do this calculation with t=T.INV.2T(.10, 16)=1.75 however in R I cannot find the correct way to get the same result.
qt(p = 1-.9, df = 17-1) = -1.34
qt(p = (1-.9)/2, df = 17-1) = -1.75 # trying with two-tailed?
What is the function R doing the same computation as T.INV.2T in Excel.
Similarly, we have also T.DIST.2T in Excel, what is the same function in R?

You need the 1 - .1 / 2 = 0.95 quantile from the t-distribution with 17 - 1 = 16 degrees of freedom:
qt(0.95, 16)
# [1] 1.745884
Explanation
Excel describes T.INV.2T as
Returns the two-tailed inverse of the Student's t-distribution
which is the quantile in math talk (though I would never use the term 2 tailed quantile). The p% quantile q is defined as the point which satisfies P(X <= q) >= p%.
In R we get that with the function qt (q for quantile, t for t-distribution). Now we just have to sort out what is meant by a two-tailed inverse. It turns out we are looking for the point q which satisfies P(X <= -|q| | X >= |q|) >= .1. Since the t-distribution is symmetrical this simplifies to P(X >= |q|) >= .1 / 2.
You can easily verify that in R with the use of the probability function pt:
pt(qt(0.05, 16), 16, lower.tail = TRUE) +
pt(qt(0.95, 16), 16, lower.tail = FALSE)
# [1] 0.1

As you correctly guessed, you do it by estimating the two-sided interval (alpha/2 = 0.1/2 = 0.05)
> qt(p = 0.95, df = 16)
[1] 1.745884
So 5 % off the upper and lower interval. I don't know Excel, but I am guessing that's what that function is doing.
As for dist, that is I assume the two-sided CDF
pt(-1.745884, df=16, lower.tail=T) +
pt(1.745884, df=16, lower.tail=F)
which is equal to 0.09999994.

Related

Difference drawing random numbers from distributions R [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 24 days ago.
I am comparing these two forms of drawing random numbers from a beta and a Gaussian distribution. What are their differences? Why are they different?
The first way (_1) simulates from a Uniform(0,1) and then applies the inverse CDF of the Beta (Normal) distribution on those uniform draws to get draws from the Beta (Normal) distribution.
While the second way (_2) uses the default function to generate random numbers from the distribution.
Beta Distribution
set.seed(1)
beta_1 <- qbeta(runif(1000,0,1), 2, 5)
set.seed(1)
beta_2 <- rbeta(1000, 2,5)
> summary(beta_1); summary(beta_2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.009481 0.164551 0.257283 0.286655 0.387597 0.895144
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.006497 0.158083 0.261649 0.284843 0.396099 0.841760
Here every number is different.
Normal distribution
set.seed(1)
norm_1 <- qnorm(runif(1000, 0,1), 0, 0.1)
set.seed(1)
norm_2 <- rnorm(1000, 0, 0.1)
> summary(norm_1); summary(norm_2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3008048 -0.0649125 -0.0041975 0.0009382 0.0664868 0.3810274
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.300805 -0.069737 -0.003532 -0.001165 0.068843 0.381028
Here the numbers are almost the same except in the mean and median
Shouldn't all be equal? Because I am generating random numbers from distributions with the same parameters
I think your question boils down to the assumption about the random number generator. If rnorm used the same RNG as runif under the hood, then your expectation would hold. It does not use the same RNG. The normal distribution RNG and uniform RNG are separate. See ?RNGkind. Without that exact match, you are left with the statistical tests below:
Is the mean of norm_1 different from the mean of norm_2?
t.test(x = norm_1, y = norm_2)
p-value > 0.05 indicates there is insufficient evidence to reject the null hypothesis that the means are equal at the 0.05 type I error level
Are the distributions different?
ks.test(x = norm_1, y = norm_2)
p-value > 0.05 indicates there is insufficient evidence to reject the null hypothesis that the distributions are equal at the 0.05 type I error level
I tried to sample a Bernoulli parameter at home using two different ways.
I flip a coin and assign 1 to heads and 0 to tails
I roll a six sided dice and assign the result 1 to a roll of the 3 highest numbers and the result 0 to a roll of the 3 lowest numbers.
I did this only twenty times instead of thousand times but the principle is the same. I got the following results:
result 0
result 1
Method 1
11
9
Method 2
8
12
Q: Why did I not get the same result for both methods?
A: Well, it is of course because they are samples and are supposed to be variable everytime.
If I would be able to reset some random seed to remove the variability, then this still doesn't matter because they are different methods.
Why is there no use of inverse transform sampling?
The normal distribution actually does use inverse transform sampling. The following command returns the same value of 0.3735462
set.seed(1)
rnorm(1,1,1)
set.seed(1)
qnorm(runif(1),1,1)
Also the rbeta uses inverse transform sampling and the following returns the same 0.7344913 and 0.2655087, which are only different by the relationship Y = 1-X (so internally there is some inversion)
alpha = 1
beta = 1
set.seed(1)
rbeta(1,alpha,beta)
set.seed(1)
qbeta(runif(1),alpha,beta)
The beta function becomes different when when $\alpha$ and $\beta$ are not both equal to one. This is because the inverse sampling is not very efficient and the the rbeta function will do some algorithm that creates the sample in a different way. Below is a code with the algorithm for the case that $min(\alpha,\beta) \leq 1$.
See for more about the algorithm: Hung, Ying-Chao, Narayanaswamy Balakrishnan, and Yi-Te Lin. "Evaluation of beta generation algorithms." Communications in Statistics-Simulation and Computation 38.4 (2009): 750-770.
You can see a few points that are calculated differently. The algorithm has a few steps where it starts redrawing random numbers, and it does this because redrawing numbers is easier than computing the inverse transform for a difficult case.
alpha = 0.9
beta = 0.9
#### Cheng's BC algorithm
### used if min(alpha,beta)<=1
### initialize
set.seed(1)
p = min(alpha,beta)
q = max(alpha,beta)
a = p+q
b = p^-1
delta = 1+q-p
k1 = delta*(0.0138889+0.0416667*p)/(q*b-0.777778)
k2 = 0.25 + (0.5+0.25/delta)*p
sample = function() {
### Perform steps of algorithm in a loop
step = 1
while(step<6) {
if (step == 1) {
U1 = runif(1)
U2 = runif(1)
if (U1 < 0.5)
{step = 2}
else
{step = 3}
}
if (step == 2) {
Y = U1*U2
Z = U1*Y
if (0.25*U2 + Z-Y >= k1) {
step = 1
} else {
step = 5
}
}
if (step == 3) {
Z = U1^2*U2
if (Z > 0.25) {
step = 4
} else {
V = b*log(U1/(1-U1))
W = q*exp(V)
step = 6
}
}
if (step == 4) {
if (Z < k2) {
step = 5
} else {
step = 1
}
}
if (step == 5) {
V = b*log(U1/(1-U1))
W = q*exp(V)
if (a*(log(a/(p+W))+V) - 1.3862944 < log(Z)) {
step = 1
} else {
step = 6
}
}
}
if (q == alpha) {
X = W/(p+W)
} else {
X = p/(p+W)
}
return(X)
}
sample()
n = 20
beta_orig = sapply(1:n,function(x) {
set.seed(x)
rbeta(1,alpha,beta)
})
beta_quantile = sapply(1:n,function(x) {
set.seed(x)
qbeta(runif(1),alpha,beta)
})
beta_BC = sapply(1:n,function(x) {
set.seed(x)
sample()
})
plot(beta_orig,beta_BC, pch = 1, xlim = c(0,1), ylim = c(0,1))
points(beta_orig,beta_quantile, col = 2, pch = 3)
legend(0.3,1, c("rbeta compared to inverse transform sampling", "rbeta compared to manual"), pch=c(3,1), col = c(2,1), cex = 0.85)
Some weird effect
In the code above I was resetting the random seed for each computation. The inverse transform is only the same for the first number. When you compute multiple numbers then only the first number is the same.
The following code
set.seed(1)
rnorm(6,1,1)
set.seed(1)
qnorm(runif(6),1,1)
set.seed(2)
rnorm(6,1,1)
set.seed(2)
qnorm(runif(6),1,1)
returns
[1] 0.3735462 1.1836433 0.1643714 2.5952808 1.3295078 0.1795316
[1] 0.3735462 0.6737666 1.1836433 2.3297993 0.1643714 2.2724293
[1] 0.1030855 1.1848492 2.5878453 -0.1303757 0.9197482 1.1324203
[1] 0.10308546 1.53124079 1.18484918 0.03810797 2.58784531 2.58463150
What you see here is that rnorm function skips a number. The reason is because it samples two random numbers to create more precision.
See these lines in the source ode of the norm_rand() function that R uses https://svn.r-project.org/R/trunk/src/nmath/snorm.c
define BIG 134217728 /* 2^27 */
/* unif_rand() alone is not of high enough precision */
u1 = unif_rand();
u1 = (int)(BIG*u1) + unif_rand();
return qnorm5(u1/BIG, 0.0, 1.0, 1, 0);

Using R to sample 3 proportion variables so that the three samples add to 1

I have a data set that is split into 3 profiles
Profile 1 = 0.478 (95% confidence interval: 0.4, 0.56)
Profile 2 = 0.415 (95% confidence interval: 0.34, 0.49)
Profile 3 = 0.107 (95% confidence interval: 0.06, 0.15)
Profile 1 + Profile 2 + Profile 3 = 1
I want to create a stochastic model that selects a value for each profile from each proportion's confidence interval. I want to keep that these add up to one. I have been using
pro1_prop<- rpert (1, 0.4, 0.478, 0.56)
pro2_prop<- rpert (1, 0.34, 0.415, 0.49)
pro3_prop<- 1- (pro1_prop + pro2_prop)
But this does not seem robust enough. Also on some iterations, (pro1_prop + pro2_prop) >1 which results in a negative value for pro3_prop. Is there a better way of doing this? Thank you!
It is straightforward to sample from the posterior distributions of the proportions using Bayesian methods. I'll assume a multinomial model, where each observation is one of the three profiles.
Say the counts data for the three profiles are 76, 66, and 17.
Using a Dirichlet prior distribution, Dir(1/2, 1/2, 1/2), the posterior is also Dirichlet-distributed: Dir(76.5, 66.5, 17.5), which can be sampled using normalized random gamma variates.
x <- c(76, 66, 17) # observations
# take 1M samples of the proportions from the posterior distribution
theta <- matrix(rgamma(3e6, rep(x + 1/2, each = 1e6)), ncol = 3)
theta <- theta/rowSums(theta)
head(theta)
#> [,1] [,2] [,3]
#> [1,] 0.5372362 0.3666786 0.09608526
#> [2,] 0.4008362 0.4365053 0.16265852
#> [3,] 0.5073144 0.3686412 0.12404435
#> [4,] 0.4752601 0.4367119 0.08802793
#> [5,] 0.4428575 0.4520680 0.10507456
#> [6,] 0.4494075 0.4178494 0.13274311
# compare the Bayesian credible intervals with the frequentist confidence intervals
cbind(
t(mapply(function(i) quantile(theta[,i], c(0.025, 0.975)), seq_along(x))),
t(mapply(function(y) setNames(prop.test(y, sum(x))$conf.int, c("2.5%", "97.5%")), x))
)
#> 2.5% 97.5% 2.5% 97.5%
#> [1,] 0.39994839 0.5537903 0.39873573 0.5583192
#> [2,] 0.33939396 0.4910900 0.33840295 0.4959541
#> [3,] 0.06581214 0.1614677 0.06535702 0.1682029
If samples within the individual 95% CIs are needed, simply reject samples that fall outside the desired interval.
TL;DR: Sample all three values (for example from a pert distribution, as you did) and norm those values afterwards so they add up to one.
Sampling all three values independently from each other and then dividing by their sum so that the normed values add up to one seems to be the easiest option as it is quite hard to sample from the set of legal values directly.
Legal values:
The downside of my approach is that the normed values are not necessarily legal (i.e. in the range of the confidence intervals) any more. However, for these values using a pert distribution, this only happens about 0.5% of the time.
Code:
library(plotly)
library(freedom)
library(data.table)
# define lower (L) and upper (U) bounds and expected values (E)
prof1L <- 0.4
prof1E <- 0.478
prof1U <- 0.56
prof2L <- 0.34
prof2E <- 0.415
prof2U <- 0.49
prof3L <- 0.06
prof3E <- 0.107
prof3U <- 0.15
dt <- as.data.table(expand.grid(
Profile1 = seq(prof1L, prof1U, by = 0.002),
Profile2 = seq(prof2L, prof2U, by = 0.002),
Profile3 = seq(prof3L, prof3U, by = 0.002)
))
# color based on how far the points are away from the center
dt[, color := abs(Profile1 - prof1E) + abs(Profile2 - prof2E) + abs(Profile3 - prof3E)]
# only keep those points that (almost) add up to one
dt <- dt[abs(Profile1 + Profile2 + Profile3 - 1) < 0.01]
# plot the legal values
fig <- plot_ly(dt, x = ~Profile1, y = ~Profile2, z = ~Profile3, color = ~color, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers()
fig
# try to simulate the legal values:
# first sample without considering the condition that the profiles need to add up to 1
nSample <- 100000
dtSample <- data.table(
Profile1Sample = rpert(nSample, prof1L, prof1U, prof1E),
Profile2Sample = rpert(nSample, prof2L, prof2U, prof2E),
Profile3Sample = rpert(nSample, prof3L, prof3U, prof3E)
)
# we want to norm the samples by dividing by their sum
dtSample[, SampleSums := Profile1Sample + Profile2Sample + Profile3Sample]
dtSample[, Profile1SampleNormed := Profile1Sample / SampleSums]
dtSample[, Profile2SampleNormed := Profile2Sample / SampleSums]
dtSample[, Profile3SampleNormed := Profile3Sample / SampleSums]
# now get rid of the cases where the normed values are not legal any more
# (e.g. Profile 1 = 0.56, Profile 2 = 0.38, Profile 3 = 0.06 => dividing by their sum
# will make Profile 3 have an illegal value)
dtSample <- dtSample[
prof1L <= Profile1SampleNormed & Profile1SampleNormed <= prof1U &
prof2L <= Profile2SampleNormed & Profile2SampleNormed <= prof2U &
prof3L <= Profile3SampleNormed & Profile3SampleNormed <= prof3U
]
# see if the sampled values follow the desired distribution
hist(dtSample$Profile1SampleNormed)
hist(dtSample$Profile2SampleNormed)
hist(dtSample$Profile3SampleNormed)
Histogram of normed sampled values for Profile 1:
Ok, some thoughts on the matter.
Lets think about Dirichlet distribution, as one providing RV summed up to 1.
We're talking about Dir(a1, a2, a3), and have to find needed ai.
From the expression for E[Xi]=ai/Sum(i, ai), it is obvious we could get three ratios solving equations
a1/Sum(i, ai) = 0.478
a2/Sum(i, ai) = 0.415
a3/Sum(i, ai) = 0.107
Note, that we have only solved for RATIOS. In other words, if in the expression for E[Xi]=ai/Sum(i, ai) we multiply ai by the same value, mean will stay the same. So we have freedom to choose multiplier m, and what will change is the variance/std.dev. Large multiplier means smaller variance, tighter sampled values around the means
So we could choose m freely to satisfy three 95% CI conditions, three equations for variance but only one df. So it is not possible in general.
One cold play with numbers and the code

Why am I getting NAs in this calculation in R?

While working on an Rcpp program, I used the sample() function, which gave me the following error: "NAs not allowed in probability." I traced this issue to the fact that the probability vector I used had NA values in it. I have no idea how. Below is some R code that captures the errors:
n.0=20
n.1=20
n.reps=1
beta0.vals=rep(seq(-.3,.1,,n.0),n.reps)
beta1.vals=rep(seq(-7,0,,n.1),n.reps)
beta.grd=as.matrix(expand.grid(beta0.vals,beta1.vals))
n.rnd=200
beta.rnd.grd=cbind(runif(n.rnd,min(beta0.vals),max(beta0.vals)),runif(n.rnd,min(beta1.vals),max(beta1.vals)))
beta.grd=rbind(beta.grd,beta.rnd.grd)
N = 22670
count = 0
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
phi01 = exp(logit)/(1 + exp(logit))
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
Here, $\beta_0 \in (-0.3, 0.1), \beta_1 \in (-7, 0), N = 22670, N_\text{sum} \in (1, N)$. Note that $N$ and $N_\text{sum}$ are integers, whereas the beta values may not be.
Since mathematically, $\phi_{01} \in (0,1)$, I'm assuming that NAs are arising because R is not liking extremely small values. I am receiving an overwhelming amount of NA values, too. More so than numbers. Why would I be getting NAs in this code?
Include print(logit) next to count = count + 1 and you will find lots of logit > 1000 values. exp(1000) == Inf so you divide Inf by Inf which will get you a NaN and NaN is NA:
> exp(500)
[1] 1.403592e+217
> Inf/Inf
[1] NaN
> is.na(NaN)
[1] TRUE
So your problems are not too small but to large numbers coming first out of the evaluation of exp(x) with x larger then roughly 700:
> exp(709)
[1] 8.218407e+307
> exp(710)
[1] Inf
Bernhard's answer correctly identifies the problem:
If logit is large, exp(logit) = Inf.
Here is a solution:
for(i in 1:dim(beta.grd)[1]){ # iterate through 600 possible beta values in beta grid
beta.ind = 0 # indicator for current pair of beta values
for(j in 1:N){ # iterate through all possible Nsums
logit = beta.grd[i,1]/N*(j - .1*N)^2 + beta.grd[i,2];
## This one isn't great because exp(logit) can be very large
# phi01 = exp(logit)/(1 + exp(logit))
## So, we say instead
## phi01 = 1 / ( 1 + exp(-logit) )
phi01 = plogis(logit)
if(is.na(phi01)){
count = count + 1
}
}
}
cat("Total number of invalid probabilities: ", count)
# Total number of invalid probabilities: 0
We can use the more stable 1 / (1 + exp(-logit)
(to convince yourself of this, multiply your expression with exp(-logit) / exp(-logit)),
and luckily either way, R has a builtin function plogis() that can calculate these probabilities quickly and accurately.
You can see from the help file (?plogis) that this function evaluates the expression I gave, but you can also double check to assure yourself
x = rnorm(1000)
y = 1 / (1 + exp(-x))
z = plogis(x)
all.equal(y, z)
[1] TRUE

The way to get the same answer by binom.test or prop.test

I'd like to get the same answer by binom.test or prop.test in R for the following question. How can I get the same answer of my manual calculation(0.009903076)?
n=475, H0:p=0.05, H1:p>0.05
What is the probability of phat>0.0733?
n <- 475
p0 <- 0.05
p <- 0.0733
(z <- (p - p0)/sqrt(p0*(1 - p0)/n))
# [1] 2.33
(ans <- 1 - pnorm(z))
# [1] 0.009903076
You can get this from prop.test():
prop.test(n*p, n, p0, alternative="greater", correct=FALSE)
# data: n * p out of n, null probability p0
# X-squared = 5.4289, df = 1, p-value = 0.009903
# alternative hypothesis: true p is greater than 0.05
# 95 percent confidence interval:
# 0.05595424 1.00000000
# sample estimates:
# p
# 0.0733
#
You can't get the result from binom.test() so far as I can tell because n*p is not an integer, it's 34.8175. The binom.test() function only takes an integer values number of successes, so when you convert this to 35 by rounding, p effectively becomes 0.07368421, which makes the rest of your results not match. Even if you had a situation where n*p was an integer, binom.test() would still not produce the same answer because it's not using a normal approximation as your original code does - it's using the binomial distribution to calculate the probability above p0.

Confidence Interval and Standard error of Skewness and Kurtosis

Please tell me how to calculate skewness and kurtosis along with their respective standard error and confidence interval associated with it(i.e. SE of Skewness and S.E of Kurtosis) I found two packages
1) package:'measure' can only calculate skewness and kurtosis
2) package:'rela' can calcuate both skewness and kurtosis but uses bootstrap by default and no command to turn it off during the calculation.
I'm simply copying and pasting the code published by Howard Seltman in here:
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
spssSkewKurtosis=function(x) {
w=length(x)
m1=mean(x)
m2=sum((x-m1)^2)
m3=sum((x-m1)^3)
m4=sum((x-m1)^4)
s1=sd(x)
skew=w*m3/(w-1)/(w-2)/s1^3
sdskew=sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
kurtosis=(w*(w+1)*m4 - 3*m2^2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1^4)
sdkurtosis=sqrt( 4*(w^2-1) * sdskew^2 / ((w-3)*(w+5)) )
mat=matrix(c(skew,kurtosis, sdskew,sdkurtosis), 2,
dimnames=list(c("skew","kurtosis"), c("estimate","se")))
return(mat)
}
To get skewness and kurtosis of a variable along with their standard errors, simply run this function:
x <- rnorm(100)
spssSkewKurtosis(x)
## estimate se
## skew -0.684 0.241
## kurtosis 0.273 0.478
The standard errors are valid for normal distributions, but not for other distributions. To see why, you can run the following code (which uses the spssSkewKurtosis function shown above) to estimate the true confidence level of the interval obtained by taking the kurtosis estimate plus or minus 1.96 standard errors:
set.seed(12345)
Nsim = 10000
Correct = numeric(Nsim)
b1.ols = numeric(Nsim)
b1.alt = numeric(Nsim)
for (i in 1:Nsim) {
Data = rnorm(1000)
Kurt = spssSkewKurtosis(Data)[2,1]
seKurt = spssSkewKurtosis(Data)[2,2]
LowerLimit = Kurt -1.96*seKurt
UpperLimit = Kurt +1.96*seKurt
Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit
}
TrueConfLevel = mean(Correct)
TrueConfLevel
This gives you 0.9496, acceptably close to the expected 95%, so the standard errors work as expected when the data come from a normal distribution. But if you change Data = rnorm(1000) to Data = runif(1000), then you are assuming that the data come from a uniform distribution, whose theoretical (excess) kurtosis is -1.2. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit to Correct[i] = LowerLimit <= -1.2 & -1.2 <= UpperLimit gives the result 1.0, meaning that the 95% intervals were always correct, rather than correct for 95% of the samples. Hence, the standard error seems to be overestimated (too large) for the (light-tailed) uniform distribution.
If you change Data = rnorm(1000) to Data = rexp(1000), then you are assuming that the data come from an exponential distribution, whose theoretical (excess) kurtosis is 6.0. Making the corresponding change from Correct[i] = LowerLimit <= 0 & 0 <= UpperLimit to Correct[i] = LowerLimit <= 6.0 & 6.0 <= UpperLimit gives the result 0.1007, meaning that the 95% intervals were correct only for 10.07% of the samples, rather than correct for 95% of the samples. Hence, the standard error seems to be underestimated (too small) for the (heavy-tailed) exponential distribution.
Those standard errors are grossly incorrect for non-normal distributions, as the simulation above shows. Thus, the only use of those standard errors is to compare the estimated kurtosis with the expected theoretical normal value (0.0); e.g., using a test of hypothesis. They cannot be used to construct a confidence interval for the true kurtosis.
#HBat is right: provided your sample data is Gaussian, you can compute the standard error using the equation from wikipedia
n = len(sample)
se_skew = ((6*n*(n-1))/((n-2)*(n+1)*(n+3)))**0.5
However, #BigBendRegion is also right: if your data is not Gaussian, this does not work. Then you may need to bootstrap.
R has the DescTools package that can bootstrap confidence intervals for skew (among other things). It can be included in python using rpy2 like so:
""" Import rpy2 and the relevant package"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
DescTools = importr('DescTools')
""" You will probably need this if you want to work with numpy arrays"""
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
def compute_skew(data, confidence_level=0.99):
""" Compute the skew and confidence interval using rpy2, DescTools
#param data
#return dict with keys: skew, skew_ci_lower, skew_ci_upper"""
d = {}
d["skew"], d["skew_ci_lower"], d["skew_ci_upper"] = DescTools.Skew(data, conf_level=confidence_level)
return d
""" Call the function on your data (assuming that is saved in a variable named sample)"""
print(compute_skew(sample))
try package psych:
> a <- data.frame(cola=rep(c('A','B','C'),100),colb=sample(1:1000,300),colc=rnorm(300))
> describe(a)
vars n mean sd median trimmed mad min max range skew kurtosis se
cola* 1 300 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00 0.00 -1.51 0.05
colb 2 300 511.76 285.59 506.50 514.21 362.50 1.00 999.00 998.00 -0.04 -1.17 16.49
colc 3 300 0.12 1.04 0.05 0.10 1.07 -2.54 2.91 5.45 0.12 -0.24 0.06
> describe(a)$skew
[1] 0.00000000 -0.04418551 0.11857609
def skew_kurt(dataframe: pd.DataFrame) -> pd.DataFrame:
out = []
for col in dataframe:
x = dataframe[col]
sd = x.std()
if sd == 0:
out.append({name: np.nan for name in ['skew stat', 'skew se', 'kurt stat', 'kurt se']})
continue
w, m1 = len(x), x.mean()
dif = x - m1
m2, m3, m4 = tuple([(dif ** i).sum() for i in range(2, 5)])
skew = w * m3 / (w - 1) / (w - 2) / sd ** 3
skew_se = np.sqrt(6 * w * (w - 1) / ((w - 2) * (w + 1) * (w + 3)))
kurt = (w * (w + 1) * m4 - 3 * m2 ** 2 * (w - 1)) / ((w - 1) * (w - 2) * (w - 3) * sd ** 4)
kurt_se = np.sqrt(4 * (w ** 2 - 1) * skew_se ** 2 / ((w - 3) * (w + 5)))
out.append({'skew stat': skew, 'skew se': skew_se, 'kurt stat': kurt, 'kurt se': kurt_se})
dataframe = pd.DataFrame(out, index = list(dataframe))
dataframe['skew<2'] = np.absolute(dataframe['skew stat']) < 2 * dataframe['skew se']
dataframe['kurt<2'] = np.absolute(dataframe['kurt stat']) < 2 * dataframe['kurt se']
return dataframe[['skew stat', 'skew se', 'skew<2', 'kurt stat', 'kurt se', 'kurt<2']]

Resources