Choice of statistical test (in R) of two apparently different distributions - r

I have the following list of data each has 10 samples.
The values indicate binding strength of a particular molecule.
What I want so show is that 'x' is statistically different from
'y', 'z' and 'w'. Which it does if you look at X it has
more values greater than zero (2.8,1.00,5.4, etc) than others.
I tried t-test, but all of them shows insignificant difference
with high P-value.
What's the appropriate test for that?
Below is my code:
#!/usr/bin/Rscript
x <-c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
t.test(x,y)
t.test(x,z)

You have not specified in what way you expect the samples to differ. One typically assumes you mean the mean differs across samples. In that case, the t-test is appropriate. While x has some high values, it also has some low values which pull the mean in. It seems what you thought was a significant difference (visually) is actually a larger variance.
If your question is about variance, then you need an F-test.

The classic test for this type of data is analysis of variance. Analysis of variance tells you if the means of all four categories are the likely the same (failure to reject null hypothesis) or if at least one mean likely differs from the others (rejection of the null hypothesis).
If the anova is significant, you will often want to perform the Tukey HSD post-hoc test to figure out which category differs from the others. Tukey HSD yields p-values that are already adjusted for multiple comparisons.
library(ggplot2)
library(reshape2)
x <- c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,
0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,
0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,
0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,
0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
dat = data.frame(x, y, z, w)
mdat = melt(dat)
anova_results = aov(value ~ variable, data=mdat)
summary(anova_results)
# Df Sum Sq Mean Sq F value Pr(>F)
# variable 3 5.83 1.9431 2.134 0.113
# Residuals 36 32.78 0.9105
The anova p-value is 0.113 and the Tukey test p-values for your "x" category are in a similar range. This is the quantification of your intuition that "x" is different from the others. Most researchers would find p = 0.11 to be suggestive but still have too high risk of being a false positive. Note that the large difference in means (diff column) along with the boxplot figure below might be more persuasive than the p-value.
TukeyHSD(anova_results)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = value ~ variable, data = mdat)
#
# $variable
# diff lwr upr p adj
# y-x -0.92673335 -2.076048 0.2225815 0.1506271
# z-x -0.82314118 -1.972456 0.3261737 0.2342515
# w-x -0.88266565 -2.031981 0.2666492 0.1828672
# z-y 0.10359217 -1.045723 1.2529071 0.9948795
# w-y 0.04406770 -1.105247 1.1933826 0.9995981
# w-z -0.05952447 -1.208839 1.0897904 0.9990129
plot_1 = ggplot(mdat, aes(x=variable, y=value, colour=variable)) +
geom_boxplot() +
geom_point(size=5, shape=1)
ggsave("plot_1.png", plot_1, height=3.5, width=7, units="in")

In your question you referred to the distributions being different b/c some of them had more values greater than 0. Defining the distributions according to the "number of values greater than 0", then you would use the binomial distribution (after converting the values to 1's and 0's). A function you could then use would be prop.test()

Related

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

GLM LM predicted values sum and High Dispersion

Any reason why the sum of predicted values and sum of dependent variable is same?
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl*100, trt*20)
lm.D9 <- glm(weight ~ group,family = gaussian())
summary(lm.D9)
y<-predict(lm.D9,newdata=group,type="response")
sum(weight)
sum(y)
Also the dispersion is also very high (in my actual data). Any leads on how to tackle this? My original data is buidling a model on actual vs expected. I have tried 2 different models,
Ratio of Actual by Expected as dependent and GLM with gaussian
Actual - Expected difference as dependent.
But the dispersion in the second case is very high, and both models not validating.
Help appreciated!
You have two groups, when you perform a linear regression, the predicted value is the mean of each group:
predict(lm.D9,newdata=data.frame(group=c("Ctl","Trt")))
1 2
503.20 93.22
You can check this:
tapply(weight,group,mean)
Ctl Trt
503.20 93.22
And if you sum up the predicted values, it is essentially the number of observations * mean which gives you back the sum of your values to begin with.
we can check how the data looks, and to me it looks ok, no crazy outliers:
boxplot(weight ~ group)
You can check out this post, the dispersion in lm is the sum of squared residuals divided by degree of freedom, basically the square of the deviation from your predicted values:
sum(residuals(lm.D9)^2)/lm.D9$df.residual
[1] 1825.962
Given the mean of your data is 298.21 , an average deviation of sqrt(1825.962) = 42.73128 is pretty ok

Why does the distribution of p-values peak at around zero when running a 1000x bootstrapped t-test on my data?

I have some numerical measurements on two groups of people and I would like to compare means between these two groups. Just using a t-test for that purpose which gives me a confidence interval and p-value. Now, I'd like do a bootstrap analysis on this data to get a feel for the variability of both the CI and p-values.
I'm using R and the boot package. The data is stored in dataframe data. To calculate the statistics I have this function:
calculate <- function(formula, data, indices) {
d <- data[indices,]
m <- t.test(formula, data=d)
return(c(m$conf.int, m$p.value))
}
Then I run the bootstrap as follows:
results <- boot(data=data, statistic=calculate, R=1000, formula=y ~ x)
Then I plot the p-values in "results" as follows:
hist(results$t[,3], breaks=32)
The histogram looks as shown below. I understand that the distribution of p-values is skewed because the p-value is constrained to be no smaller than zero. But I don't understand why the peak of the distribution is at zero as well, no matter how many breaks I display in the histogram.
Any insight would be greatly appreciated!
We can try with an example dataset:
library(boot)
set.seed(111)
data = data.frame(y = rnorm(60,rep(0:1,each=30),1),x=rep(1:2,each=30))
results <- boot(data=data, statistic=calculate, R=1000, formula=y ~ x)
The original observation itself is significant:
t.test(y~x,data=data)
Welch Two Sample t-test
data: y by x
t = -4.0621, df = 55.339, p-value = 0.0001547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.8262396 -0.6197054
sample estimates:
mean in group 1 mean in group 2
-0.1881403 1.0348322
And even if you sample it you, most of the bootstraps will show a significant value, but we can see how this varies on the -log10 scale, actual observed value in blue:
hist(-log10(results$t[,3]),br=20,xlab="Bootstrap -log 10 p values")
abline(v= -log10(results$t0[3]),lty=8,col="blue")

Generating similar estimates of interactions in afex, lsmeans, and lme4 packages

I would like to know if there is a way get the same estimates of an interaction effect in afex & lsmeans packages as in lmer. The toy data below is for two groups with different intercepts and slopes.
set.seed(1234)
A0 <- rnorm(4,2,1)
B0 <- rnorm(4,2+3,1)
A1 <- rnorm(4,6,1)
B1 <- rnorm(4,6+2,1)
A2 <- rnorm(4,10,1)
B2 <- rnorm(4,10+1,1)
A3 <- rnorm(4,14,1)
B3 <- rnorm(4,14+0,1)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- factor(rep(1:8,times = 4, length = 32))
time <- factor(rep(0:3, each = 8, length = 32))
timeNum <- as.numeric(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id, group, time, timeNum, score)
df
And here is the plot
(ggplot(df, aes(x = time, y = score, group = group)) +
stat_summary(fun.y = "mean", geom = "line", aes(linetype = group)) +
stat_summary(fun.y = "mean", geom = "point", aes(shape = group), size = 3) +
coord_cartesian(ylim = c(0,18)))
When I run a standard lmer on the data looking for an estimate of the difference in change in score over time between groups.
summary(modelLMER <- lmer(score ~ group * timeNum + (timeNum|id), df))
I get an estimate for the group*time interaction of -1.07, which means that the increase in score for a one-unit increase in time is ~1 point less in group B than group A. This estimate matches the preset differences I built into the dataset.
What I would like to know is how to do a similar thing in the afex and lsmeans packages.
library(afex)
library(lsmeans)
First I generated the afex model object
modelLM <- aov_ez(id="id", dv="score", data=df, between="group", within="time",
type=3, return="lm")
Then passed that into the lsmeans function
lsMeansLM <- lsmeans(modelLM, ~rep.meas:group)
My goal is to generate an accurate estimate of the group*time interaction in afex and lsmeans. To do so requires specifying custom contrast matrices based on the split specified in the lsmeans function above.
groupMain = list(c(-1,-1,-1,-1,1,1,1,1)) # group main effect
linTrend = list(c(-3,-1,1,3,-3,-1,1,3)) # linear trend
linXGroup = mapply("*", groupMain, linTrend) # group x linear trend interaction
Then I made a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
Which I passed into the contrast function in lsmeans.
contrast(lsMeansLM, contrasts)
The F and p values in the output match those for the automatic tests for linear trend and for the group difference in linear trend generated from a mixed ANCOVA in SPSS. However the mixed ANCOVA does not generate an estimate.
The estimate of the effect using the procedure above, instead of being approx. -1, like in the lmer (and matching the difference I built into the data) is approx. -10, which is wildly inaccurate.
I assume it has something to do with how I am coding the contrast coefficients. I know if I normalise the coefficients of the groupMain matrix by dividing all coefficients by four that yields an accurate estimate of the main effect of group averaged across all timepoints. But I have no idea how to get an accurate estimate either of linear trend averaged across groups (linTrend), or an accurate estimate of the difference in linear trend across groups (linXGroup).
I am not sure if this question is more suitable for here or Cross Validated. I figured here first because it seems to be software related, but I know there are probably deeper issues involved. Any help would be much appreciated.
The issue here is that timeNum is a numeric predictor. Therefore, the interaction is a comparison of slopes. Note this:
> lstrends(modelLMER, ~group, var = "timeNum")
group timeNum.trend SE df lower.CL upper.CL
A 4.047168 0.229166 6.2 3.490738 4.603598
B 2.977761 0.229166 6.2 2.421331 3.534191
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
> pairs(.Last.value)
contrast estimate SE df t.ratio p.value
A - B 1.069407 0.3240897 6.2 3.3 0.0157
There's your 1.07 - the opposite sign because the comparison is in the other direction.
I will further explain that the lsmeans result you describe in the question is a comparison of the two group means, not an interaction contrast. lsmeans uses a reference grid:
> ref.grid(modelLMER)
'ref.grid' object with variables:
group = A, B
timeNum = 1.5
and as you can see, timeNum is being held fixed at its mean of 1.5. The LS means are predictions for each group at timeNum = 1.5 -- often called the adjusted means; and the difference is thus the difference between those two adjusted means.
Regarding the discrepancy claimed in obtaining your linear contrast of about 10.7: The linear contrast coefficients c(-3,-1,1,3) give you a multiple of the slope of the line. To get the slope, you need to divide by sum(c(-3,-1,1,3)^2) -- and also multiply by 2, because the contrast coefficients increment by 2.
Thanks to the invaluable help of #rvl I was able to solve this. Here is the code.
In order to generate the correct contrast matrices we first need to normalise them
(mainMat <- c(-1,-1,-1,-1,1,1,1,1)) # main effects matrix
(trendMat <- c(-3,-1,1,3,-3,-1,1,3) # linear trend contrast coefficients
(nTimePoints <- 4) # number of timePoints
(mainNorm <- 1/nTimePoints)
(nGroups <- 2) # number of between-Ss groups
(trendIncrem <- 2) # the incremental increase of each new trend contrast coefficient
(trendNorm <- trendIncrem/(sum(trendMat^2))) # normalising the trend coefficients
Now we create several contrast matrices in the form of lists. These are normalised using the objects we created above
(groupMain = list(mainMat*mainNorm)) # normalised group main effect
(linTrend = list(trendMat*trendNorm)) # normalised linear trend
(linXGroup = list((mainMat*trendMat)*(nGroups*trendNorm))) # group x linear trend interaction
Now pass those lists of matrices into a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
And pass that master list into the contrasts function in lsmeans
contrast(lsMeansLM, contrasts)
This is the output
contrast estimate SE df t.ratio p.value
c(-0.25, -0.25, -0.25, -0.25, 0.25, 0.25, 0.25, 0.25) 1.927788 0.2230903 6 8.641 0.0001
c(-0.15, -0.05, 0.05, 0.15, -0.15, -0.05, 0.05, 0.15) 3.512465 0.1609290 6 21.826 <.0001
c(0.3, 0.1, -0.1, -0.3, -0.3, -0.1, 0.1, 0.3) -1.069407 0.3218581 6 -3.323 0.0160
How do we check if these are accurate estimates?
Note first that the estimate of the group*time interaction is now approximately the same value as is returned by
summary(modelLMER)
The 'main effect' trend (for want of a better descriptor), which is the rate of change in score across the four time points averaged across both levels of group, is 3.51. If we change the coding of the group factor to simple coding via
contrasts(df$group) <- c(-.5,.5)
and run summary(modelLMER) again, the time estimate will now be 3.51.
Finally for the main effect of group, that is, the difference in score between groups averaged across all time points. We can run
pairs(lsmeans(modelLM,"group"))
And this will be -1.92. Thank you #rvl. A great answer. Using afex and lsmeans we have now forced a mixed ANCOVA that treats the repeated measures variable as categorical to give us estimates of group differences in trend and main effects that match those returned by a mixed-effects model where the repeated measures variable is continuous, and with p- and F-values that match those of SPSS.

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

Resources