Skewness of Random Data from rsn Function in sn Package - r

I am following the discussion presented in How to sample from the SN and related distributions, section 2.1. I have generated 100 random samples using the following code:
library(sn)
cp <- c(3,1.2,0.8)
dp <- cp2dp(cp, family="SN")
y <- rsn(100, dp=dp)
According to the description in the paper, this should generate random data with a skewness of 0.8 (gamma). However, when I calculate the skewness using skewness() from the moments library, I get a value of 0.43. I am trying to understand the discrepancy between the two values here. The paper does allude to the difference between gamma and alpha, but does not go into detail.
Fundamentally, my question is: How can I generate random data from a skewed normal (or skewed t) distribution where the output skewness() matches the input for the skewness to the random number generator function.

I cannot duplicate your problem. Since you did not set a random number seed it is impossible to replicate your exact results, but I get a much better fit. There are three different ways to define skewness. Unfortunately the documentation of the skewness function in package moments does not indicate which one it uses, but e1071 and DescTools both provide skewness functions and let you choose which one you want:
set.seed(42)
cp <- c(3,1.2,0.8)
dp <- cp2dp(cp, family="SN")
y <- rsn(100, dp=dp)
moments::skewness(y)
# [1] 0.8892228
e1071::skewness(y)
# [1] 0.8759179
DescTools::Skew(y)
# [1] 0.8759179
The results are close and comparable to your expected value of .8. Selecting a larger sample will provide a closer fit:
set.seed(123)
y <- rsn(1000, dp=dp)
moments::skewness(y)
# [1] 0.8452521
e1071::skewness(y)
# [1] 0.8439846
DescTools::Skew(y)
# [1] 0.8439846
The e1071 and DescTools package prefer the 3rd definition of skewness (see the documentation at ?e1071::skewness and ?DescTools::Skew). It appears that the moments package uses the first definition:
e1071::skewness(y, type=1)
# [1] 0.8452521

Related

Generate multivariate nonnormal random numbers in R

Background
I want to generate multivariate distributed random numbers with a fixed variance matrix. For example, I want to generate a 2 dimensional data with covariance value = 0.5, each dimensional variance = 1. The first maginal of data is a norm distribution with mean = 0, sd = 1, and the next is a exponential distribution with rate = 2.
My attempt
My attempt is that we can generate a correlated multinormal distribution random numbers and then revised them to be any distribution by Inverse transform sampling.
In below, I give an example about transforming 2 dimensional normal distribution random numbers into a norm(0,1)+ exp(2) random number:
# generate a correlated multi-normal distribution, data[,1] and data[,2] are standard norm
data <- mvrnorm(n = 1000,mu = c(0,0), Sigma = matrix(c(1,0.5,0.5,1),2,2))
# calculate the cdf of dimension 2
exp_cdf = ecdf(data[,2])
Fn = exp_cdf(data[,2])
# inverse transform sampling to get Exponetial distribution with rate = 2
x = -log(1-Fn + 10^(-5))/2
mean(x);cor(data[,1],x)
Out:
[1] 0.5035326
[1] 0.436236
From the outputs, the new x is a set of exponential(rate = 2) random numbers. Also, x and data[,1] are correlated with 0.43. The correlated variance is 0.43, not very close to my original setting value 0.5. It maybe a issue. I think covariance of sample generated should stay more closer to initial setting value. In general, I think my method is not quite decent, maybe you guys have some amazing code snippets.
My question
As a statistics graduate, I know there exist 10+ methods to generate multivariate random numbers theoretically. In this post, I want to collect bunch of code snippets to do it automatically using packages or handy . And then, I will compare them from different aspects, like time consuming and quality of data etc. Any ideas is appreciated!
Note
Some users think I am asking for package recommendation. However, I am not looking for any recommendation. I already knew commonly used statistical theroms and R packages. I just wanna know how to generate multivariate distributed random numbers with a fixed variance matrix decently and give a code example about generate norm + exp random numbers. I think there must exist more powerful code snippets to do it in a decent way! So I ask for help right now!
Sources:
generating-correlated-random-variables, math
use copulas to generate multivariate random numbers, stackoverflow
Ross simulation, theoretical book
R CRAN distribution task View

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Get degrees of freedom for a Standardized T Distribution with MLE

First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)

Using the sd command in R with a binomially distributed variable

I want to know whether the sd command in R works accurately when calculating the standard deviation of a binomial distribution.
Take the following example:
coin <- c("heads", "heads", "tails", "heads", "tails", "heads", "heads", "tails")
die <- as.factor(coin)
The standard deviation formula for such a distribution would be:
sd <- sqrt(n*p*(1-p))
where n is the number of trials, and p is the probability of success.
So we would calculate it in R as follows:
sqrt(8*(5/8)*(3/8))
[1] 1.369306
However, when we use the sd command, we get a different answer:
sd(coin)
[1] 0.5175492
Does the sd function in R not take into consideration the fact that the variable is not numeric. That explanation would make sense to me if R returned an error message, but it produces a result. Can you please clarify this for me? Thanks.
The sd function returns the (corrected) sample standard deviation (not the theoretic SD of a Bernoulli random variable). The sample SD is defined as sqrt( sum((x-x_bar)^2)/(N-1)). See ?sd and ?var Your example can be checked:
samp_var_die <- sum((as.numeric(die)-mean(as.numeric(die)))^2)/(length(die)-1)
samp_sd_die <- sqrt(samp_var_die)
samp_sd_die
#[1] 0.5175492
If you are interested in exploring the theoretic aspects of statistical distributions, there is a nice suite of packages devoted to this topic. Check out the distr-package and it's cousins: distrEllipse, distrEx, distrMod, distrRmetrics, distrSim, distrTeach, and RandVar. I found playing with functions and examples from those packages quite educational and entertaining.
By the way, that 1.3+ value you got would be the SD (theoretic sigma) around the estimate of 5 you would have gotten from that series of observations.

Density of a Two-Piece Normal (or Split Normal) Distribution

Is there a density function for the two-piece Normal distribution:
on CRAN? Thought I would check before I code one. I have checked the distribution task view. It is not listed there. I have looked in a couple of likely packages, but to no avail.
Update: I have added dsplitnorm, psplitnorm, qsplitnorm and rsplitnorm functions to the fanplot package.
If you choose to construct your own version of the distribution, you might be interested in distr. It (and the related packages distrEx, distrSim, distrTEst, distrTeach and distrDoc) have been written to provide a unified interface for constructing new distributions from existing ones. (I constructed this example with the help of the wonderful vignette that accompanies the distrDoc package and which can be gotten by typing vignette("distr").)
This implements the split normal distribution, which may not be exactly what you are after. Using the distr toolset, though, it shouldn't be too hard to adjust this to fit your exact needs.
library(distr)
## Construct the distribution object.
## Here, it's a split normal distribution with mode=0, and lower- and
## upper-half standard deviations of 1 and 2, respectively.
splitNorm <- UnivarMixingDistribution(Truncate(Norm(0,2), upper=0),
Truncate(Norm(0,1), lower=0),
mixCoeff=c(0.5, 0.5))
## Construct its density function ...
dsplitNorm <- d(splitNorm)
## ... and a function for sampling random variates from it
rsplitNorm <- r(splitNorm)
## Compare the density it returns to that from rnorm()
dsplitNorm(-1)
# [1] 0.1760327
dnorm(-1, sd=2)
# [1] 0.1760327
## Sample and plot a million random variates from the distribution
x <- rsplitNorm(1e6)
hist(x, breaks=100, col="grey")
## Plot the distribution's continuous density
plot(splitNorm, to.draw.arg="d")

Resources