i need to write an own test in R with the help of the mean of a given test statistic of 2 given random variables X and Y which are unknown distributed.
I am given following code:
mean.test <- function(x, y, B=10000,
alternative=c("two.sided","less","greater"))
{
p.value <- 0
alternative <- match.arg(alternative)
s<-replicate(B, (mean(sample(c(x,y), B, replace=TRUE))-mean(sample(c(x,y), B, replace=TRUE)))) # random samples of test statistics
t <- mean(x) - mean(y) #teststatistics t
p.value <- 2 * (1- pnorm(mean(s))) #try to calculate p value
data.name <- deparse(substitute(c(x,y)))
names(t) <- "difference in means"
zero <- 0
names(zero) <- "difference in means"
return(structure(list(statistic = t, p.value = p.value,
method = "mean test", data.name = data.name,
observed = c(x,y), alternative = alternative,
null.value = zero),
class = "htest"))
}
Where t is the mean of a random set of the variables X and Y substracted from each other. I am given some solution to some function calls, but i never get them.
For example following:
set.seed(0)
mean.test(rnorm(100,50,4),rnorm(100,51,5),alternative="less")
Should output:
mean test
data: c(rnorm(100, 50, 4), rnorm(100, 51, 5))
difference in means = -2.0224, p-value = 0.0011
alternative hypothesis: true difference in means is less than 0
But it outputs:
mean test
data: c(rnorm(100, 50, 4), rnorm(100, 51, 5))
difference in means = -0.68157, p-value = 1
alternative hypothesis: true difference in means is less than 0
I am sure that i am calculating the p value in a wrong way. Also the mean values substracted from each other are wrong for this example, but right for other examples of the excercise. I am really confused as how to calculate the p value. How do i calculate it?
Related
I am trying to do Fisher's exact test for combinations of an n x 2 dataframe and from what I have read, pairwise fishers seems to be what I want to use (see here). However, in doing so it produced p-value results that didn't look right, so I decided to manually check on combinations and got different results. I've included what I hope is a reproducible example to highlight what I've tried. Perhaps I'm doing something wrong with the R code, as I'm still relatively inexperienced, or I may be completely misunderstanding what the pairwise tests are meant to compute - if so, sorry and I can remove the question if it's not appropriate for SO.
# Packages -----------------------------------------------------------
library("tidyverse")
library("janitor")
library("RVAideMemoire")
library("fmsb")
# Generate Data -----------------------------------------------------------
set.seed(1)
test <-
tibble(
"drug" = sample(
c("Control", "Treatment1", "Treatment2"),
size = 300,
prob = c(0.1, 0.4, 0.3),
replace = TRUE),
"country" = sample(
c("Canada", "United States"),
size = 300,
prob = c(0.4, 0.6),
replace = TRUE
),
"selected" = sample(
c(0, 1),
size = 300,
prob = c(0.1, 0.65),
replace = TRUE)
)
test2 <- test %>%
filter(selected == 1)
test2_tab <- test2 %>%
tabyl(drug, country) %>%
remove_rownames() %>%
column_to_rownames(var = colnames(.[1])) %>%
as.matrix()
When I run the following pairwise tests I get this as the output (I used 2 packages just to make sure it wasn't that I just implemented one incorrectly).
# Pairwise ----------------------------------------------------------------
RVAideMemoire::fisher.multcomp(test2_tab, p.method = "bonferroni")
fmsb::pairwise.fisher.test(test2_tab, p.adjust.method = "bonferroni")
Pairwise comparisons using Fisher's exact test for count data
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
Pairwise comparisons using Pairwise comparison of proportions (Fisher)
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
However, when I create the individual tables to perform individual Fisher's test, like below, I get different results.
# Individual --------------------------------------------------------------
drug.groups2 <- unique(test2$drug)
# Just to check the correct 2x2 tables are produced
# combn(drug.groups2, 2, function(x) {
# id <- test2$drug %in% x
# cross_tabs <- table(test2$drug[id], test2$country[id])
# }, simplify = FALSE)
combn(drug.groups2, 2, function(x) {
id <- test2$drug %in% x
cross_tabs <- table(test2$drug[id], test2$country[id])
fishers <- fisher.test(cross_tabs)
fishers$data.name <-
paste(
unique(
as.character(test2$drug[id])
),collapse="-")
return(fishers)
}, simplify = FALSE)
[[1]]
Fisher's Exact Test for Count Data
data: Treatment1-Treatment2
p-value = 0.3357
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7566901 2.4175206
sample estimates:
odds ratio
1.347105
[[2]]
Fisher's Exact Test for Count Data
data: Treatment1-Control
p-value = 0.4109
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2560196 1.6292583
sample estimates:
odds ratio
0.6637235
[[3]]
Fisher's Exact Test for Count Data
data: Treatment2-Control
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3294278 2.3146386
sample estimates:
odds ratio
0.8940101
Isn't it due to Bonferroni correction which is applied to pairwise comparisons while is not applied to individual tests?
As clearly pointed out in the comments by Lukasz and StupidWolf, I had forgotten that I had applied the p.method = "bonferroni" correction, and the results are the same with the function call p.method = "none" ...
I am attempting to follow the example from LSD.test on my data set. Unfortunately, my data set has unequal sample sizes, I have read that this can be handled by a weighted mean; does anyone have experience with this? Is there a way to calculate this outside of the function?
Here is my example code:
library(agricolae)
data <- as.data.frame(c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 11)))
data$Value <- c(59.15,48.90,29.65,32.60,63.85,53.85,66.40,55.05,54.75,39.95,63.20,57.40,59.15,54.10,49.40,78.70,66.20,90.75,
81.20,52.25,53.70,51.10,48.60,50.15,63.40,56.15,38.40,66.45,53.35,45.30,46.60,53.20,53.95,44.55,49.15,42.65,
68.25,67.60,57.90,47.85,52.90)
colnames(data) <- c("Treatment", "Value")
cal <- lm(Value ~ Treatment, data = data)
model<-aov(cal)
out <- LSD.test(model,"Treatment", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(data,LSD.test(Value,Treatment,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"Treatment",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
This is still "working" however not listing the "Minimum significant difference" as in the typical example:
library(agricolae)
data(sweetpotato)
model<-aov(yield~virus, data=sweetpotato)
out <- LSD.test(model,"virus", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(sweetpotato,LSD.test(yield,virus,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"virus",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
Edit 1:I have also tried using unequal groups with HSD with agricolae, however this returns an error:
data(sweetpotato)
A<-sweetpotato[-c(4,5,7),]
modelUnbalanced <- aov(yield ~ virus, data=A)
outUn <-HSD.test(modelUnbalanced, "virus",group=FALSE, unbalanced = TRUE)
Error in HSD.test(modelUnbalanced, "virus", group = FALSE, unbalanced = TRUE) :
unused argument (unbalanced = TRUE)
Edit 2: I have now used the group=FALSE command to find the individual interactions:
difference pvalue signif. LCL UCL
A - B -14.820000 0.0301 * -28.664094 -0.9759055
A - C -2.245000 1.0000 -16.089094 11.5990945
A - D -3.557727 1.0000 -17.083524 9.9680696
B - C 12.575000 0.0943 . -1.269094 26.4190945
B - D 11.262273 0.1554 -2.263524 24.7880696
C - D -1.312727 1.0000 -14.838524 12.2130696
Now how would I find what difference = p =0.05? I have looked at a linear relationship between the three p values that aren't equal to 1, and the root square of the difference. The relationship isn't perfect but it is strong R2 = 0.98, likely an effect of the sample sizes.
Could I use this to predict what the least significant difference would be # p = 0.05? Or am I completely misguided?
Any help is greatly appreciated!
Cheers,
I wrote this code to run a test statistic on two randomly distributed observations x and y
mean.test <- function(x, y, B=10000,
alternative=c("two.sided","less","greater"))
{
p.value <- 0
alternative <- match.arg(alternative)
s <- replicate(B, (mean(sample(c(x,y), B, replace=TRUE))-mean(sample(c(x,y), B, replace=TRUE))))
t <- mean(x) - mean(y)
p.value <- 2*(1- pnorm(abs(quantile(T,0.01)), mean = 0, sd = 1, lower.tail =
TRUE, log.p = FALSE)) #try to calculate p value
data.name <- deparse(substitute(c(x,y)))
names(t) <- "difference in means"
zero <- 0
names(zero) <- "difference in means"
return(structure(list(statistic = t, p.value = p.value,
method = "mean test", data.name = data.name,
observed = c(x,y), alternative = alternative,
null.value = zero),
class = "htest"))
}
the code uses a Monte-Carlo simulations to generate the distribution function of the test statistic mean(x) - mean(y) and then calculates the p-value, but apparently i miss defined this p-value because for :
> set.seed(0)
> mean.test(rnorm(1000,3,2),rnorm(2000,4,3))
the output should look like:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
but i got this instead:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value = 0.8087
alternative hypothesis: true difference in means is not equal to 0
can someone explain the bug to me ?
As far as I can tell, your code has numerous mistakes and errors in it:
quantile(T, 0.01) - here T == TRUE, so you're calculating the quantile of 1.
The object s is never used.
mean(sample(c(x,y), B, replace=TRUE)) What are you trying to do here? The c() function combines x and y. Sampling makes no sense since you don't know what population they come from
When you calculate the test statistic t, it should depend on the variance (and sample size).
I have a problem with ks function in R. I have a Laplace Distribution:
ldes <- function(y, a) {
if(y < 0.5) 1/a*log(2*y, 2)
else 1/a*log(2*(1-y), 2)
}
a <- 1
set.seed(1)
y = runif(1000, 0, 1)
ld <- ldes(y, a)
So, I need to do the ks test, but can't find anything about second parameter that should be in there, like:
ks.test(my_lnorm, **plnorm**, mean = -5, sd = 5)
for Lognormal Destribution or:
ks.test(my_log, **plogis**, location = 2, scale = 3)
for Logistics Destribution
Thanks.
You can try some package for the laplace distribution, for example disclap (if it satisfies our need, otherwise some continuous analog).
library(disclap)
ks.test(ld, "pdisclap", 0.5) # choose the right value of parameter p (p=0.5 is arbitrary)
One-sample Kolmogorov-Smirnov test
data: ld
D = 0.3333, p-value < 2.2e-16
alternative hypothesis: two-sided
As can be seen from the result of the hypothesis test, the null hypothesis (that the samples are drawn from the same population distribution) is rejected.
y2 <- rdisclap(1000, p=0.5) # generate some simulated datapoints
plot(ecdf(ld), xlim = range(c(ld, y2))) # compare ecdfs
plot(ecdf(y2), add = TRUE, lty = "dashed")
I have a dataset that I need to transfer into normal distribution.
First, Generate a reproducible dataset.
df <- runif(500, 0, 100)
Second, define a function. This function will continue transforming d.f. until P > 0.05. The transformed d.f. will be generated and named as y.
BoxCoxTrans <- function(y)
{
lambda <- 1
constant <- 0
while(shapiro.test(y)$p.value < 0.10)
{
constant <- abs(min(y, na.rm = TRUE)) + 0.001
y <- y + constant
lambda <- powerTransform(y)$lambda
y <- y ^ lambda
}
assign("y", y, envir = .GlobalEnv)
}
Third, test df
shapiro.test(df)
Shapiro-Wilk normality test
data: df
W = 0.95997, p-value = 2.05e-10
Because P < 0.05, transform df
BoxCoxTrans(df)
Then it gives me the following error messages,
Error in qr.resid(xqr, w * fam(Y, lambda, j = TRUE)) :
NA/NaN/Inf in foreign function call (arg 5)
What did I do wrong?
You could use a Box-Muller Transformation to generate an approximately normal distribution from a random uniform distribution. This might be more appropriate than a Box-Cox Transformation, which AFAIK is typically applied to convert a skewed distribution into one that is almost normal.
Here's an example of a Box-Muller Transformation applied to a set of uniformly distributed numbers:
set.seed(1234)
size <- 5000
a <- runif(size)
b <- runif(size)
y <- sqrt(-2 * log(a)) * cos(2 * pi * b)
plot(density(y), main = "Example of Box-Muller Transformation", xlab="x", ylab="f(x)")
library(nortest)
#> lillie.test(y)
#
# Lilliefors (Kolmogorov-Smirnov) normality test
#
#data: y
#D = 0.009062, p-value = 0.4099
#
#> shapiro.test(y)
#
# Shapiro-Wilk normality test
#
#data: y
#W = 0.99943, p-value = 0.1301
#
Hope this helps.
Add
print(summary(y))
before the end of your while loop and watch your computation explode. In any event, repetitively applying Box-Cox makes no sense because you get the ML(-like) estimator of the transformation parameter from the first application. Moreover, why would you expect a power transformation to normalize a uniform distribution?
John