T test in R over large data frame - r

I'm attempting to run a t-test over a large data frame. The data frame contains CpG sites in the columns and the case/control groups in the rows.
Sample of the data:
Type cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289
1 Normal.01 0.32605 0.89785 0.73910 0.30960 0.80654 0.60874
2 Normal.05 0.28981 0.89931 0.72506 0.29963 0.81649 0.62527
3 Normal.11 0.25767 0.90689 0.77163 0.27489 0.83556 0.66264
4 Normal.15 0.26599 0.89893 0.75909 0.30317 0.81778 0.71451
5 Normal.18 0.29924 0.89284 0.75974 0.33740 0.83017 0.69799
6 Normal.20 0.27242 0.90849 0.76260 0.27898 0.84248 0.68689
7 Normal.21 0.22222 0.89940 0.72887 0.25004 0.80569 0.69102
8 Normal.22 0.28861 0.89895 0.80707 0.42462 0.86252 0.61141
9 Normal.24 0.43764 0.89720 0.82701 0.35888 0.78328 0.65301
10 Normal.57 0.26827 0.91092 0.73839 0.30372 0.81349 0.66338
There are 10 "normal" types and 62 "case" types (normal = rows 1-10, case = rows 11-62).
I attempted to run the following t-test on the 16384 CpG sites, but it only returned 72 p-values:
t.result <- apply(data[1:72,], 2, function (x) t.test(x[1:10],x[11:72],paired=FALSE))
data$p_value <- unlist(lapply(t.result, function(x) x$p.value))
data$fdr <- p.adjust(data$p_value, method = "fdr")
Any help would be much appreciated.

Probably you want something like this:
set.seed(1)
data <- matrix(runif(72*16384), nrow=72) # some random data as surrogate for your original data
indices <- expand.grid(1:10, 11:72) # generate all indices of pairs for t-test
t.result <- apply(indices, 1, function (x) t.test(data[x[1],],data[x[2],],paired=FALSE))
p_values <- unlist(lapply(t.result, function(x) x$p.value))
p_fdr <- p.adjust(p_values, method = "fdr")
hist(p_fdr, col='red', xlim=c(0,1), xlab='p-value', main='Histogram of p-values')
hist(p_values, add=TRUE, col=rgb(0, 1, 0, 0.5))
legend('topleft', legend=c('unadjusted', 'fdr-adjusted'), col=c('red', rgb(0, 1, 0, 0.5)), lwd=2)
As expected, almost all of the false positives were eliminated with FDR adjusting of the p-values.

Related

R - Categorize a dataset

Morning folks,
I'm trying to categorize a set of numerical values (Days Left divided by 365.2 which gives us approximately the numbers of years left until a maturity).
The results of this first calculation give me a vector of 3560 values (example: 0.81, 1.65, 3.26 [...], 0.2).
I'd like to categorise these results into intervals, [Between 0 and 1 Year, 0 and 2 Years, 0 and 3 years, 0 and 4 years, Over 4 years].
#Set the Data Frame
dfMaturity <- data.frame(Maturity = DATA$Maturity)
#Call the library and Run the function
MaturityX = ddply(df, .(Maturity), nrow)
#Set the Data Frame
dfMaturityID <- data.frame(testttto = DATA$Security.Name)
#Calculation of the remaining days
MaturityID = ddply(df, .(dfMaturityID$testttto), nrow)
survey <- data.frame(date=c(DATA$Maturity),tx_start=c("1/1/2022"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%d/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%d/%Y")
# Data for the table
MaturityName <- MaturityID$`dfMaturityID$testttto
MaturityZ <- survey$date
TimeToMaturity <- as.numeric(survey$date_diff)
# /!/ HERE IS WHERE I NEED HELP /!/ I'M TRYING TO CATEGORISE THE RESULTS OF THIS CALCULATION
Multiplier <- TimeToMaturity /365.2
cx <- cut(Multiplier, breaks=0:5)
The original datasource comes from an excel file (DATA$Maturity)
If it can helps you:
'''
print(Multiplier)
'''
gives us
print(Multiplier)
[1] 0.4956188 1.4950712 1.9989047 0.2464403 0.9994524 3.0010953 5.0000000 7.0016429 9.0005476
[10] 21.0021906 4.1621030 13.1626506 1.1610077 8.6664841 28.5377875 3.1626506 6.7497262 2.0920044
[19] 2.5602410 4.6495071 0.3368018 6.3225630 8.7130340 10.4956188 3.9019715 12.7957284 5.8378970
I copied the first three lines, but there is a total 3560 objects.
I'm open to any kind of help, I just want it to work :) thank you !
The cut function does that:
example <- c(0.81, 1.65, 3.26, 0.2)
cut(example, breaks = c(0, 1, 2, 3, 4),
labels = c("newborn", "one year old", "two", "three"))
Edit:
From the comment
I'd like then to create a table with for example: 30% of the objects has a maturity between 0 and 1 year
You could compute that using the function below:
example <- c(0.81, 1.65, 3.26, 0.2)
share <- function(x, lower = 0, higher= 1){
x <- na.omit(x)
sum((lower <= x) & (x < higher))/length(x)
}
share(1:10, lower = 0,higher = 3.5) # true for 1:3 out of 1:10 so 30%
share(1:10, lower = 4.5, higher = 5.5) # true for 5 so 10%)
share(example, 0, 3)

Extracting certain levels more than others

I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.
df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
species = letters[1:6])
print(df)
rarity species
1 common a
2 common b
3 uncommon c
4 uncommon d
5 rare e
6 rare f
I then create another data set based on the random sampling of rows from df.
df.sampled <- df[sample(1:nrow(df), 30, T),]
The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:
df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
df.sampled <- df[sample(1:6, 30, T),]
df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)
Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...
Here is your line edited to use the prob argument with example values of 0.6 for common, 0.3 for uncommon and 0.1 for rare:
prob_vec <- c(0.6, 0.6, 0.3, 0.3, 0.1, 0.1)
df.sampled <- df[sample(1:nrow(df), 30, T, prob = prob_vec),]
df.sampled now has a more uneven distribution.

repeated measures bootstrap stats, grouped by multiple factors

I have a data frame that looks like this, but obviously with many more rows etc:
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
cond=c('A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'),
comm=c('X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y','X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'),
measure=c(0.8, 1.1, 0.7, 1.2, 0.9, 2.3, 0.6, 1.1, 0.7, 1.3, 0.6, 1.5, 1.0, 2.1, 0.7, 1.2))
So we have 2 factors (each with 2 levels, thus 4 combinations) and one continuous measure. We also have a repeated measures design in that we have multiple measure's within each cell that correspond to the same id.
I've attempted to first solve the groupby issue, then the bootstrap issue, then combine the two, but am pretty much stuck...
Stats, grouped by the 2 factors
I can get multiple summary stats for each of the 4 cells by:
summary_stats <- aggregate(df$measure,
by = list(df$cond, df$comm),
function(x) c(mean = mean(x), median = median(x), sd = sd(x)))
print(summary_stats)
resulting in
Group.1 Group.2 x.mean x.median x.sd
1 A X 0.85000000 0.85000000 0.12909944
2 B X 0.65000000 0.65000000 0.05773503
3 A Y 1.70000000 1.70000000 0.58878406
4 B Y 1.25000000 1.20000000 0.17320508
This is great as we are getting multiple stats for each of the 4 cells.
But what I'd really like is the 95% bootstrap CI's, for each stat, for each of the 4 cells. I don't mind if I have to run a final solution once for statistic (e.g. mean, median, etc), but bonus points for doing it all in one go.
Bootstrap for repeated measures
Can't quite make this work, but what I want is 95% bootstrap CI's, done in a way which is appropriate for this repeated measures design. Unless I'm mistaken then I want to select bootstrap samples on the basis of id (not on the basis of rows of the dataframe), then calculate a summary measure (e.g. mean) for each of the 4 cells.
library(boot)
myfunc <- function(data, indices) {
# select bootstrap sample to index into `id`
d <- data[data$id==indicies,]
return(c(mean=mean(d), median=median(d), sd = sd(d)))
}
bresults <- boot(data = CO2$uptake, statistic = myfunc, R = 1000)
Q1: I'm getting errors in selecting the bootstrap sample by id, i.e. the line d <- data[ data$id==indicies, ]
Combining bootstrap and the groupby 2 factors
Q2: I have no intuition of how to gel the two approaches together to achieve the final desired result. My only idea is to put the aggregate call in myfunc, to repeatedly calculate cell stats under each bootstrap replicate, but I'm out of my comfort zone with R here.
With your two questions, you have two issues:
How to bootstrap (resample) your data in such a way that you resample based on id, rather than rows
How to perform separate bootstraps for the four groups in your 2x2 design
One easy way to do this would be by using the following packages (all part of the tidyverse):
dplyr for manipulating your data (in particular, summarising the data you have for each id) and also for the neat %>% forward pipe operator which supplies the result of an expression as the first argument to the next expression so you can chain commands
broom for doing an operation for each group in your dataframe
boot (which you already use) for the bootstrapping
Load the packages:
library(dplyr)
library(broom)
library(boot)
First of all, to make sure when we resample we include a subject or not, I would save the various values each subject has as a list:
df <- df %>%
group_by(id, cond, comm) %>%
summarise(measure=list(measure)) %>%
ungroup()
Now the dataframe has fewer rows (4 per ID), and the variable measure is not numeric anymore (instead, it's a list). This means we can just use the indices that boot provides (solving issue 1), but also that we'll have to "unlist" it when we actually want to do calculations with it, so your function now becomes:
myfunc <- function(data, indices) {
data <- data[indices,]
return(c(mean=mean(unlist(data$measure)),
median=median(unlist(data$measure)),
sd = sd(unlist(data$measure))))
}
Now that we can simply use boot to resample each row, we can think about how to do it neatly per group. This is where the broom package comes in: you can ask it to do an operation for each group in your data frame, and store it in a tidy dataframe, with one row for each of your groups, and a column for the values that your function produces. So we simply group the dataframe again, and then call do(tidy(...)), with a . instead of the name of our variable. This hopefully solves issue 2 for you!
bootresults <- df %>%
group_by(cond, comm) %>%
do(tidy(boot(data = ., statistic = myfunc, R = 1000)))
This produces:
# Groups: cond, comm [4]
cond comm term statistic bias std.error
<fctr> <fctr> <chr> <dbl> <dbl> <dbl>
1 A X mean 0.85000000 0.000000000 5.280581e-17
2 A X median 0.85000000 0.000000000 5.652979e-17
3 A X sd 0.12909944 -0.004704999 4.042676e-02
4 A Y mean 1.70000000 0.000000000 1.067735e-16
5 A Y median 1.70000000 0.000000000 1.072347e-16
6 A Y sd 0.58878406 -0.005074338 7.888294e-02
7 B X mean 0.65000000 0.000000000 0.000000e+00
8 B X median 0.65000000 0.000000000 0.000000e+00
9 B X sd 0.05773503 0.000000000 0.000000e+00
10 B Y mean 1.25000000 0.001000000 7.283065e-02
11 B Y median 1.20000000 0.027500000 7.729634e-02
12 B Y sd 0.17320508 -0.030022214 5.067446e-02
Hopefully this is what you'd like to see!
If you want to then use the values from this dataframe a bit more, you can use other dplyr functions to select which rows in this table you look at. For example, to look at the bootstrapped standard error of the standard deviation of your measure for condition A / X, you can do the following:
bootresults %>% filter(cond=='A', comm=='X', term=='sd') %>% pull(std.error)
I hope that helps!
For a bootstrap with a cluster variable, here's a solution without additional packages. I didn't use the boot package though.
Part 1: Bootstrap
This function draws a random sample from a set of clustered observations.
.clusterSample <- function(x, id){
boot.id <- sample(unique(id), replace=T)
out <- lapply(boot.id, function(i) x[id%in%i,])
return( do.call("rbind",out) )
}
Part 2: Boostrap estimates and CIs
The next function draws multiple samples and applies the same aggregate statement to each of them. The bootstrap estimates and CIs are then obtained by mean and quantile.
clusterBoot <- function(data, formula, cluster, R=1000, alpha=.05, FUN){
# cluster variable
cls <- model.matrix(cluster,data)[,2]
template <- aggregate(formula, .clusterSample(data,cls), FUN)
var <- which( names(template)==all.vars(formula)[1] )
grp <- template[,-var,drop=F]
val <- template[,var]
x <- vapply( 1:R, FUN=function(r) aggregate(formula, .clusterSample(data,cls), FUN)[,var],
FUN.VALUE=val )
if(is.vector(x)) dim(x) <- c(1,1,length(x))
if(is.matrix(x)) dim(x) <- c(nrow(x),1,ncol(x))
# bootstrap estimates
est <- apply( x, 1:2, mean )
lo <- apply( x, 1:2, function(i) quantile(i,alpha/2) )
up <- apply( x, 1:2, function(i) quantile(i,1-alpha/2) )
colnames(lo) <- paste0(colnames(lo), ".lo")
colnames(up) <- paste0(colnames(up), ".up")
return( cbind(grp,est,lo,up) )
}
Note the use of vapply. I use it because I prefer working with arrays over lists. Note also that I used the formula interface to aggregate, which I also like better.
Part 3: Examples
It can be used with any kind of stats, basically, even without grouping variables. Some examples include:
myStats <- function(x) c(mean = mean(x), median = median(x), sd = sd(x))
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=myStats)
# cond comm mean median sd mean.lo median.lo sd.lo mean.up median.up sd.up
# 1 A X 0.85 0.850 0.11651125 0.85 0.85 0.05773503 0.85 0.85 0.17320508
# 2 B X 0.65 0.650 0.05773503 0.65 0.65 0.05773503 0.65 0.65 0.05773503
# 3 A Y 1.70 1.700 0.59461417 1.70 1.70 0.46188022 1.70 1.70 0.69282032
# 4 B Y 1.24 1.215 0.13856406 1.15 1.15 0.05773503 1.35 1.35 0.17320508
clusterBoot(data=df, formula=measure~cond+comm, cluster=~id, R=10, FUN=mean)
# cond comm est .lo .up
# 1 A X 0.85 0.85 0.85
# 2 B X 0.65 0.65 0.65
# 3 A Y 1.70 1.70 1.70
# 4 B Y 1.25 1.15 1.35
clusterBoot(data=df, formula=measure~1, cluster=~id, R=10, FUN=mean)
# est .lo .up
# 1 1.1125 1.0875 1.1375

Get specific elements from clustered data in R

I generate this image using the hclust function. Now I wand to ID of those elements highlighted by squares.
Is there any way to get the ID and related value from the clusted datasets? Thanks
EDIT
I used this R script
library(gplots)
library(geneplotter)
# read the data in from URL
bots <- read.table("expression.txt")
# get just the alpha data
abot <- bots[,c(1:9)]
rownames(abot) <- bots[,1]
abot[1:7,]
# get rid of NAs
abot[is.na(abot)] <- 0
# we need to find a way of reducing the data. Can't do ANOVA as there are no
# replicates. Sort on max difference and take first 1000
min <-apply(abot, 1, min)
max <- apply(abot, 1, max)
sabot <- abot[order(max - min, decreasing=TRUE),][1:1000,]
# cluster on correlation
cdist <- as.dist(1 - cor(t(sabot)))
hc <- hclust(cdist, "average")
# draw a heatmap
x11()
heatmap.2(as.matrix(sabot),
Rowv=as.dendrogram(hc),
Colv=FALSE,
cexRow=1,
cexCol=1,
dendrogram="row",
scale="row",
trace="none",
density.info="none",
key=FALSE,
col=greenred.colors(80))
and my data look like this
YF MF SF YL ML SL Stem Root SULE
1 31.64075611 32.2728151 38.81790359 252.8901009 269.7599455 138.5011042 16.58308894 10.47935935 3.364295997
2 6.484902171 9.141084197 5.748798541 3.637332586 4.762966989 4.149302282 7.194971046 9.932508868 1.600027931
3 14.15218386 8.784155316 9.740794214 6.566584262 6.130503033 7.747728536 12.57014531 15.75181203 9.22907038
4 15.72881736 19.95755802 10.13050089 10.31313758 9.838844457 14.24864327 13.00442008 23.85404067 12.17251862
5 30.45475953 15.57131432 17.15277867 8.884751572 8.78786964 12.4745649 11.90176123 35.9844343 6.904763942
6 15.87149807 19.05523246 13.12846166 12.99750491 15.3775883 19.0044086 21.66051467 20.38501538 39.58478032
7 16.58935728 18.63990933 17.20955634 13.04423927 29.98424087 18.02165996 22.22403582 32.38377369 10.90832984
8 29.91118855 19.65844846 23.45958109 62.56338088 55.3926187 39.85296152 31.4832543 14.8484163 1.326553777
9 4.09192129 15.52499475 12.14321788 1.680854758 3.448485979 5.245481483 15.14443161 28.85873063 1.073855381
10 7.02768911 4.267210165 3.383501945 3.53716686 3.105614581 3.493791292 3.806360251 6.713067543 3.338740245
11 17.61821596 18.03607855 12.939663 8.951935241 15.45268577 15.53817186 20.5098186 23.42760284 27.97680418
12 66.35291651 40.41837702 37.7239447 32.42998176 30.09696289 27.81089554 33.27197681 46.5393928 4.141505618
13 15.45804403 15.98469202 17.21176468 9.105208867 11.76140929 13.9751105 14.72159466 25.68388472 7.493988128

computing T-test with help of apply function

I have a matrix :
>data
A A A B B C
gene1 1 6 11 16 21 26
gene2 2 7 12 17 22 27
gene3 3 8 13 18 23 28
gene4 4 9 14 19 24 29
gene5 5 10 15 20 25 30
I want to to test whether the mean of each gene (rows) values are different between different groups for each gene or not? I want to use T-test for it. The function should take all columns belong to group A, take all columns belongs to group B, take all columns belongs to group C,... and calculate the T-test between each groups for each genes.(every groups contains several columns)
on implementation which I got from answer to my previews post is :
Results <- combn(colnames(data), 2, function(x) t.test(data[,x]), simplify = FALSE)
sapply(Results, "[", c("statistic", "p.value"))
but it does compute between all columns rather than between groups for every row. can somebody help me how to modify this code to calculate T test between groups like for my data ?
Maybe this can be usuful
> Mat <- matrix(1:20, nrow=4, dimnames=list(NULL, letters[1:5]))
> # t.test
> Results <- combn(colnames(Mat), 2, function(x) t.test(Mat[,x]), simplify = FALSE)
> names(Results) <- apply(Pairs, 2, paste0, collapse="~")
> Results # Only the first element of the `Results` is shown
$`a~b` # t.test applied to a and b
One Sample t-test
data: Mat[, x]
t = 5.1962, df = 7, p-value = 0.001258
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.452175 6.547825
sample estimates:
mean of x
4.5
...
A nicer output:
> sapply(Results, "[", c("statistic", "p.value"))
a~b a~c a~d a~e b~c b~d b~e c~d
statistic 5.196152 4.140643 3.684723 3.439126 9.814955 6.688732 5.41871 14.43376
p.value 0.00125832 0.004345666 0.007810816 0.01085005 2.41943e-05 0.0002803283 0.0009884764 1.825796e-06
c~e d~e
statistic 9.23682 19.05256
p.value 3.601564e-05 2.730801e-07
almost there, with apply, you don't give arguments inside functions, but outside
data<-matrix(1:20,4,5)
Tscore<- apply(data, 2, t.test, alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,conf.level = 0.95)
and to test if this is what you wanted, check t stats
t.test(data[,1], alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,conf.level = 0.95)
I may have misunderstood the question though, I just implemented your y=NULL, t test of single column

Resources