For an analysis I need to perform a "F-pseudosigma", also called the "pseudo standard deviation". I tried to look if it's in any R package, but can't find it myself.
There isn't much info on it to begin with.
Does any of you know a package that holds it, or if it is calculated in a function from a package?
I have to admit that I haven't heard about F-pseudo sigma (or pseudo sigma) before; but a bit of research suggests that it is simply defined as the scaled difference between the third and first quartile.
That can be easily translated into a custom R function
fpseudosig <- function(x) unname(diff(quantile(x, c(0.25, 0.75)) / 1.35))
For example, let's generate some random data x ~ N(0, 1)
set.seed(2018)
x <- rnorm(100)
Then
fpseudosig(x)
#[1] 0.9703053
References
(in no particular order)
Irwin, Exploratory Data Analysis for Beginners: "Instead of the using the standard deviation in an RSD calculation, one might consider using the sample-data deviation (F-pseudosigma). This is a nonparametric statistic analogous to the standard deviation that is calculated by using the 25th and 75th percentiles in a data set. It is resistant to the effect of extreme outliers."
https://bqs.usgs.gov/srs/SRS_Spr04/statrate.htm: "The F-pseudosigma is calculated by dividing the fourth-spread (analogous to interquartile range) by 1.349; therefore the smaller the F-pseudosigma the more precise the determinations. The 1.349 value is derived from the number of standard deviations that encompasses 50% of the data."
http://mkseo.pe.kr/stats/?p=5: "Simply put, given the first quartile H1 and the third quartile H3, pseudo sigma is (H3-H1)/1.35. Why? It’s because H1= μ – 0.675σ and H3 = μ + 0.675σ if X ∼N. Therefore, H3-H1=1.35σ, resulting in σ = (H3-H1)/1.35. We call H3-H1 as IQR(Inter Quartile Range)."
Related
I'm wondering how to use replication weights and the confint implementation of the survey package to construct bootstrap confidence intervals/standard errors.
Looking at the survey package's implementation of confint, it seems as though it's simply taking the standard error of the theta list generated after replication, and multiplying it by the statistic corresponding to a given alpha range.
But that doesn't really correspond with any bootstrap implementation I'm aware of. Typically, you'd instead be using a percentile of the distribution of theta to get the sample mean's distribution's confidence interval. The T and BCa intervals are another matter.
Here is my R Code. I have not used the provided weights, instead letting repweights weights be generated as "sample with replacement" weights with equal probability.
data(api)
d <- apiclus1 %>% select(fpc, dnum,api99)
dclus1<-svydesign(id=~dnum, data=d, fpc=~fpc)
rclus1<-as.svrepdesign(dclus1,type="bootstrap", replicates=100)
To test the confidence intervals, we can use:
test_mean <- svymean(~api99, rclus1)
confint(test_mean, df=degf(rclus1))
confint(test_mean, df=degf(rclus1)) - mean(d$api99)
Which results:
2.5 % 97.5 %
api99 554.2971 659.6592
2.5 % 97.5 %
api99 -52.68107 52.68107
So clearly the interval is symmetric, which defeats some of the purpose of using the bootstrap.
So let's try this:
test_bs <- withReplicates(rclus1, function(w, data) weighted.mean(data$api99, w), return.replicates=T)
This will bootstrap the replicates, where the weights are the repweights (which I assume are with replacement weights). Here are the intervals using BCa intervals on the replications:
bca(test$replicates) - mean(d$api99)
-43.2878 49.1148
Clearly not symmetric.
Using percentile intervals:
c(quantile(test$replicates, 0.025),quantile(test$replicates, 0.975)) - mean(d$api99)
2.5% 97.5%
-45.50944 48.06085
Valliant implements percentile intervals this way, which should be equivalent to my percentile intervals:
smho.boot.a <- as.svrepdesign(design = smho.dsgn,
type = "subbootstrap",
replicates = 500)
# total & CI for EOYCNT based on RW bootstrap
a1 <- svytotal( ̃EOYCNT, design = smho.boot.a,
na.rm=TRUE,
return.replicates = TRUE)
# Compute CI based on bootstrap percentile method.
ta1 <- quantile(a1$replicates, c(0.025, 0.975))
I'm looking for clarifications on
a. How to construct bootstrap CIs with survey package for statistics of interest
b. If my withReplication implementation of the bootstrap is correct
The percentile stuff doesn't actually work with multistage survey data (or, at least, it isn't known to work (or, at least, it isn't known to me to work)). Survey bootstraps just estimate the variance. You don't get the higher order accuracy for smooth functions that way, but you do get asymptotically the right variance. To get asymmetric intervals you need to bootstrap some suitable function of the parameter, as svyciprop and svyquantile (and in a sense svyglm) do.
If you assume simple random sampling with replacement of clusters then you could make the percentile bootstrap and extensions work, but that's not a common structure for real-world surveys (and it doesn't really need the survey package)
I'm sure the maintainer of the package would be happy to implement a bootstrap that gave better asymmetric intervals and worked for multistage samples, if someone pointed out suitable references to him.
I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don
One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.
#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.
I'm new to R and trying to learn stats..
Here is one practice question that I'm trying to figure out
How should I use R code to create a function based on this math equation?
I have a dataframe like this
the "exposed" column from the df contains two groups, one is called"Test Group (Exposed)" the other one is called "Control Group". So the math function is referring to these two groups.
In another practice I have these codes here to calculate the confidence interval
# sample size
# OK for non normal data if n > 30
n <- 150
# calculate the mean & standard deviation
will_mean <- mean(will_sample)
will_s <- sd(will_sample)
# normal quantile function, assuming mean has a normal distribution:
qnorm(p=0.975, mean=0, sd=1) # 97.5th percentile for a N(0,1) distribution
# a.k.a. Z = 1.96 from the standard normal distribution
# calculate standard error of the mean
# standard error of the mean = mean +/- critical value x (s/sqrt(n))
# "q" functions in r give the value of the statistic at a given quantile
critical_value <- qt(p=0.975, df=n-1)
error <- critical_value * will_s/sqrt(n)
# confidence inverval
will_mean - error
will_mean + error
but I'm not sure how to do the exposed 2 groups
Don't worry it's quite easy if you have experience in at least one programming language, R is quite trivial.
The only remarkable difference between R and most of other programming languanges is that R was developed for statistical purposes.
You can compute what is the quantile for a certain significance level α (reminds to divide it by 2 for your formula) by using the function qnorm(). By default it is set up for standardized normal distribution, like in your case, but you can get more details using the documentation, reachable by the command ?qnorm().
Actually in the exercise you are not required to compute it, since you have to pass it as argument, but in reality you need to.
The code should be something like:
conf <- function(p1,p2,n1,n2,z){
part = z*(p1*(1-p1)/n1+p2*(1-p2)/n2)**(1/2)
return(c(p1-p2-part,
p1-p2+part))
}
I'm using R to make some calculations. This question is about R but also about statistics.
Say I have a dataset of paired samples consisting of a subject's blood platelet concentration after injection of placebo and then again after injection of medication for a number of subjects. I want to estimate the mean difference for the paired samples. I'm just learning about the t distribution. If I wanted to a 95% confidence interval for the mean difference using a Z-test, I could simply use:
mydata$diff <- mydata$medication - mydata$placebo
mu0 <- mean(mydata$diff)
sdmu <- sd(mydata$diff) / sqrt(length(mydata$diff))
qnorm(c(0.025, 0.975), mu, sdmu)
After much confusion and cross-checking with the t.test function, I've figured out that I can get the 95% confidence interval for a t-test with:
qt(c(0.025, 0.975), df=19) * sdmu + mu0
My understanding of this is as follows:
Tstatistic = (mu - mu0)/sdmu
Tcdf^-1(0.025) <= (mu - mu0) / sdmu <= Tcdf^-1(0.975)
=>
sdmu * Tcdf^-1(0.025) + mu0 <= mu <= sdmu * Tcdf^-1(0.975) + mu0
The reason this is confusing is that if I were using a Z-test, I would write it like this:
qnorm(c(0.025, 0.975), mu0, sdmu)
and it's not until I tried to figure out how to use the t distribution that I realised I could move the normal distribution parameters out of the function too:
qnorm(c(0.025, 0.975), 0, 1) * sdmu + mu0
Trying to wrap my head around what this means mathematically, it means that the Z-statistic (mu - mu0)/sdmu is always normally distributed with mean 0 and standard deviation of 1?
What has me stumped is that I'd like to move the t distribution parameters into the arguments to the function to cut down on the enormous mental overhead of thinking about this transformation.
However, according to my version of the R function qt's documentation, in order to do this, I would need to calculate the non-centrality parameter ncp. According to (my version of) the documentation, the ncp is explained as follows:
Let T= (mX - m0) / (S/sqrt(n)) where mX is the mean and S the sample standard deviation (sd) of X_1, X_2, …, X_n which are i.i.d. N(μ, σ^2) Then T is distributed as non-central t with df= n - 1 degrees of freedom and non-centrality parameter ncp = (μ - m0) * sqrt(n)/σ.
I can't wrap my head around this at all. At first it seems to fit into my framework because Tstatistic = (mu - m0) / sdmu. But isn't μ what I want the qt function (which is Tcdf-1) to return? How can it appear in the ncp, which I need to give as an input? And what about σ? What do μ and σ mean in this context?
Basically, how can I get the same result as qt(c(0.025, 0.975), df=19) * sdmu + mu0, without any terms outside of the function call, and could I have an explanation of how it works?
Let me try to explain without using any formulae.
First of all, the student t distribution and the normal distribution are two distinct probability distributions and (in most situations) are not supposed to give you the same results.
The t distribution is the appropriate probability distribution to test for a difference between two normally distributed samples. Since we do not know the population sd we have to stick with the one we get from the sample. And that distribution is not normal distributed anymore, it is t-distributed.
The z-distribution can be used to approximate the test. In this case, we use the z-distribution as approximation of the t-distribution. However, it is recommended not to do this with low degrees of freedom. Reason: the higher degrees of freedom a t distribution has it becomes increasingly similar to a normal distribution. Textbooks usually say that t and normal distribution with df>30 are similar enough to approximate t with normal distribution. In order to do that, you would have to normalise your data, first, so that mean = 0 and sd = 1. Then you can do the approximation using the z-distribution.
I usually recommend not to use this approximation. It was a reasonable crutch when calculations had to be done on paper using your head, a pen, and a bunch of tables. There exist many workarounds in basic statistics that were supposed to give you a reasonble result with less computation effort. With modern computers that is usually obsolete (in most cases at least).
The z distribution, by the way, is defined (by convention) as a normal distribution N(0, 1) i.e. a normal distribution with mean = 0 and sd = 1.
Finally, about the different ways these distributions are specified. The normal distribution is actually the only probability distribution that I know that you can specify by setting mean and sd directly (there are dozens of distributions, in case you're interested). The non-centrality parameter has a similar effect than the mean of the normal distribution. In a plot it moves the t-distribution along the x-axis. But it also changes its shape and skews it so that mean and ncp move away from each other.
This code will show how the ncp changes the shape and location of the t-distribution:
x <- seq(-5, 15, 0.1)
plot(x, dt(x, df = 10, ncp = 0), from = -4, to = +4, type = "l")
for(ncp in 1:6)
lines(x, dt(x, df = 10, ncp = ncp))
Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196