T-Test For Genes using Apply Function in Dataframe - r

I’m trying to run a t.test on two data frames.
The dataframes (which I carved out from a data.frame) has the data I need to rows 1:143. I’ve already created sub-variables as I needed to calculate rowMeans.
> c.mRNA<-rowMeans(c007[1:143,(4:9)])
> h.mRNA<-rowMeans(c007[1:143,(10:15)])
I’m simply trying to run a t.test for each row, and then plot the p-values as histograms. This is what I thought would work…
Pvals<-apply(mRNA143.data,1,function(x) {t.test(x[c.mRNA],x[h.mRNA])$p.value})
But I keep getting an error?
Error in t.test.default(x[c.mRNA], x[h.mRNA]) :
not enough 'x' observations
I’ve got something off in my syntax and cannot figure it out for the life of me!
EDIT: I've created a data.frame so it's now just two columns, I need a p-value for each row. Below is a sample of my data...
c.mRNA h.mRNA
1 8.224342 8.520142
2 9.096665 11.762597
3 10.698863 10.815275
4 10.666233 10.972130
5 12.043525 12.140297
I tried this...
pvals=apply(mRNA143.data,1,function(x) {t.test(mRNA143.data[,1],mRNA143.data[, 2])$p.value})
But I can tell from my plot that I'm off (the plots are in a straight line).

A reproducible example would go a long way. In preparing it, you might have realized that you are trying to subset columns based on mean, which doesn't make sense, really.
What you want to do is go through rows of your data, subset columns belonging to a certain group, repeat for the second group and pass that to t.test function.
This is how I would do it.
group1 <- matrix(rnorm(50, mean = 0, sd = 2), ncol = 5)
group2 <- matrix(rnorm(50, mean = 5, sd = 2), ncol = 5)
xy <- cbind(group1, group2)
# this is just a visualization of the test you're performing
plot(0, 0, xlim = c(-5, 11), ylim = c(0, 0.25), type = "n")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), add = TRUE)
out <- apply(xy, MARGIN = 1, FUN = function(x) {
# x is a vector, e.g. xy[i, ] or xy[1, ]
t.test(x = x[1:5], y = x[6:10])$p.value
})
out

Related

How to call multiple distribution functions from different vectors into a function in R

Lets talk you through my workflow:
General idea
Based on data in a dataframe, select the appropriate distribution functions, combine them in all possible ways to get the mean of the combined distributions.
Starting position
I have a large data frame df. In there I have different variables var1, var2 and var3 in this example which contains data to select the appropriate distribution function.
I have several distribution functions per variable:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
Select the right distribution
Using an if_else on each of the vars I generate the appropriate distribution per case in a new vector. The if_else looks like this for var1 and has the same appearance for all vars:
df$distr_var1 <- if_else(df$info < 0, "var1_distr1",
if_else(df$info > 0 & df$info < 100, "var1_distr2", "var1_distr3")
This results in the following df:
df <- data.frame(distr_var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
distr_var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
distr_var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
Combine distribution functions
To combine distribution functions in a new proportional distribution function I have created this function based on this question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
And I have stored the PDFs in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
I would like to use the function foo in the df to generate proportional distributions for all combinations of distributions given in the df. So, for each case, a the following combinations: var1_var2, var1_var3, var2_var3, var1_var2_var3.
Calculate mean over distributions
If I want to calculate a mean over the distributions individually, I can do this:
means <- sapply(PDFS, pdqr::summ_mean)
df$mean_var1 <- means[df$distr_var1]
Or:
df$mean_var2 <- sapply(mget(df$distr_var2), pdqr::summ_mean)
Both approaches work fine. But on the combinations var1_var2, var1_var3, var2_var3, var1_var2_var3 I have not found a suitable approach, but tried these:
df$var1_var2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$distr_var1, df$distr_var2))), pdqr::summ_mean)
I tried to overcome not calling functions by using a list, but things seem to get too complicated / nested to work nicely...
Question
How to select the appropriate distributions given in distr_var1, distr_var2 and distr_var3, combined them using foo and calculate the mean using pdqr::summ_mean?
I'm happy with all comments, also on the workflow in general
A foreach loop works for me:
df$var1_var2_mean <- foreach(i = 1:nrow(df), .combine = c) %do% {
A <- as.name(df$var1[i])
B <- as.name(df$var2[i])
mean <- summ_mean(foo(get(A),get(B)))
}
And, for each combination I need to do this. At least I got it working...

Add p-value column in qwraps::summary_table

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian
[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

generating quintiles and recoding multiple variables in R

I have 33 columns/variables with different values. What I am trying to do: generate quintiles for all variables (done), then use the quintiles to recode each variable (-2, -1, 0, 1, 2) by quintile, I generated quintiles using:
q <- apply(ndataframe[2:34], 2, quantile, c(.2, .4, .6, .8, 1), na.rm = T)
Each variable is on different scale which is why the quintile values differ. I assume there is a better and more efficient way to then recode by quintile than what I have been doing so which is using the quintile values and manually recoding each column one by one, e.g.:
n_df_quins$A_q <- recode(n_dataframe$A
"0:1529 = '-2'; 1530:2199 = '-1'; 2200:2999 = '0'; 3000:3999 = '1'; 4000:25000 = '2'")
Thanks very much for any assistance anyone can offer.
You can use percent_rank and create a new data set with percentile values for each observation in each columns and then write a function to recode based on your criteria which you can apply on the whole data set in one go using mutate_all. Below is the code:
library("dplyr")
df<- data.frame(var1 = c(1:100), var2 = sample(1:1000, 100))
df1<- mutate_all(df, percent_rank)
recode_new<- function(x)
{
x = ifelse(x<=.2, -2, ifelse(x<=.4, -1, ifelse(x<=.6,0, ifelse(x<.8,1,2))))
return(x)
}
df_final<- mutate_all(df1, recode_new)
Let me know if you have any question

R: generate legend from dataframe variables

I am trying to generate a legend in R with reference to the following post.
I have the following MWE, which more or less represents what I'm working with. dataframes a,b and c are generated over the course of a R script, with the colours. (there might be more, as the groups are generated by a loop)
a <- density(rnorm(100,mean = 5, sd = 1))
b <- density(rnorm(100,mean = 10, sd = 1))
c <- density(rnorm(100,mean = 7, sd = 1))
plot(c,col = "#FFCC00FF")
lines(b, col = "#FF6600FF")
lines(a, col = "#FF0000FF")
legendDataFrame <- data.frame(Group = c("A","B","C"), Colour = c("#FF0000FF","#FF6600FF", "#FFCC00FF"))
legend("topleft",legend=unique(legendDataFrame$Group), pch=1, col=unique(legendDataFrame$Colour))
print(legendDataFrame)
but, i get the image like this, with incorrect colours.. suggestions?
try this:
legendDataFrame <- data.frame(stringsAsFactors=FALSE, Group = c("A","B","C"), Colour = c("#FF0000FF","#FF6600FF", "#FFCC00FF"))
P.S.
I smashed my head on data.frame(stringsAsFactors=TRUE) at least 1000 times. And I'm in good company:
http://r.789695.n4.nabble.com/stringsAsFactors-FALSE-td921891.html
http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
http://adv-r.had.co.nz/Data-structures.html
Instead of explicitly listing the colors, you can also try this if you want to maintain the dynamic functions:
legend("topleft",
legend=unique(legendDataFrame$Group),
pch=1,
col=as.vector(unique(legendDataFrame$Colour)))
It adds as.vector to convert the factor (unique(legendDataFrame$Colour)) into a vector.

R: t-test over all subsets over all columns

This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Resources