Add p-value column in qwraps::summary_table - r

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian

[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

Related

Sample draw in sapply without replacement

How does one draw a sample within a sapply function without replacement? Consider the following MWE below. What I am trying to achieve is for a number in idDRAW to receive a letter from chrSMPL (given the sample size of chrSMPL). Whether a number from idDRAW receives a letter is determined by the respective probabilities, risk factors and categories. This is calculated in the sapply function and stored in tmp.
The issue is sample replacement, leading to a number being named with a letter more than once. How can one avoid replacement whilst still using the sapply function? I have tried to adjust the code from this question (Alternative for sample) to suit my needs, but no luck. Thanks in advance.
set.seed(3)
chr<- LETTERS[1:8]
chrSMPL<- sample(chr, size = 30, replace = TRUE)
idDRAW<- sort(sample(1:100, size = 70, replace = FALSE))
p_mat<- matrix(runif(16, min = 0, max = 0.15), ncol = 2); rownames(p_mat) <- chr ## probability matrix
r_mat <- matrix(rep(c(0.8, 1.2), each = length(chr)), ncol = 2); rownames(r_mat) <- chr ## risk factor matrix
r_cat<- sample(1:2, 70, replace = TRUE) ## risk categories
# find number from `idDRAW` to be named a letter:
Out<- sapply(chrSMPL, function(x){
tmp<- p_mat[x, 1] * r_mat[x, r_cat]
sample(idDRAW, 1, prob = tmp)
})
> sort(Out)[1:3]
G B B
5 5 5
I managed with an alternative solution using a for loop as seen below. If anyone can offer suggestions on how the desired result can be achieved without using a for loop it would be greatly appreciated.
set.seed(3)
Out <- c()
for(i in 1:length(chrSMPL)){
tmp <- p_mat[chrSMPL[i], 1] * r_mat[chrSMPL[i], r_cat]
Out <- c(Out, sample(idDRAW, 1, prob = tmp))
rm <- which(idDRAW == Out[i])
idDRAW <- idDRAW[-rm]
r_cat <- r_cat[-rm]
}
names(Out) <- chrSMPL
sort(Out)[1:3]

T-Test For Genes using Apply Function in Dataframe

I’m trying to run a t.test on two data frames.
The dataframes (which I carved out from a data.frame) has the data I need to rows 1:143. I’ve already created sub-variables as I needed to calculate rowMeans.
> c.mRNA<-rowMeans(c007[1:143,(4:9)])
> h.mRNA<-rowMeans(c007[1:143,(10:15)])
I’m simply trying to run a t.test for each row, and then plot the p-values as histograms. This is what I thought would work…
Pvals<-apply(mRNA143.data,1,function(x) {t.test(x[c.mRNA],x[h.mRNA])$p.value})
But I keep getting an error?
Error in t.test.default(x[c.mRNA], x[h.mRNA]) :
not enough 'x' observations
I’ve got something off in my syntax and cannot figure it out for the life of me!
EDIT: I've created a data.frame so it's now just two columns, I need a p-value for each row. Below is a sample of my data...
c.mRNA h.mRNA
1 8.224342 8.520142
2 9.096665 11.762597
3 10.698863 10.815275
4 10.666233 10.972130
5 12.043525 12.140297
I tried this...
pvals=apply(mRNA143.data,1,function(x) {t.test(mRNA143.data[,1],mRNA143.data[, 2])$p.value})
But I can tell from my plot that I'm off (the plots are in a straight line).
A reproducible example would go a long way. In preparing it, you might have realized that you are trying to subset columns based on mean, which doesn't make sense, really.
What you want to do is go through rows of your data, subset columns belonging to a certain group, repeat for the second group and pass that to t.test function.
This is how I would do it.
group1 <- matrix(rnorm(50, mean = 0, sd = 2), ncol = 5)
group2 <- matrix(rnorm(50, mean = 5, sd = 2), ncol = 5)
xy <- cbind(group1, group2)
# this is just a visualization of the test you're performing
plot(0, 0, xlim = c(-5, 11), ylim = c(0, 0.25), type = "n")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), add = TRUE)
out <- apply(xy, MARGIN = 1, FUN = function(x) {
# x is a vector, e.g. xy[i, ] or xy[1, ]
t.test(x = x[1:5], y = x[6:10])$p.value
})
out

R: How to pick rows of a data frame that match a criteria but not filter out others

Assume a data frame that look something like this:
set.seed(42)
seqs <- sapply(1:20, FUN = function(x) { paste(sample(letters, size = 11, replace = T), collapse = "") })
annot1 <- sapply(1:1000, FUN = function(x) { sample(c("A", "B","C"), size = 1, replace = T)})
annot2 <- sapply(1:1000, FUN = function(x) { sample(c("X", "Y","Z"), size = 1, replace = T)})
values <- rnorm(n = length(annot1), mean = 1, sd = 1)
df <- data.frame(id=sample(seqs, size = length(annot1), replace = T), annot1, annot2, values)
I would like to get the rows that have a value above a given threshold, e.g. value > 1.5 in either 1 or all 3 conditions (but not 2), denoted by variables annot1 or annot2. For the ids that match this criteria, I want all values (not only the ones above the threshold).
My usual approach which consists of chaining filter() and n_distinct() doesn't work in this case since it will filter out observations where the value isn't above the threshold, which creates issue when I go to wide format later on to do clustering on these variables.
I have thought about creating intermediate variables and use them to pick up ids but it feels like there must be a more elegant solution.

Fill elements of a list without looping

I am trying not to use a for loop to assign values to the elements of a list.
Here, I create an empty list, gives it a length of 20 and name each of the 20 elements.
mylist <- list()
length(mylist) <- 20
names(mylist) <- paste0("element", 1:20, sep = "")
I want each element of mylist to contain samples drawn from a pool of randomly generated numbers denoted as x:
x <- runif(100, 0, 1)
I tried the following codes, which do not get to the desired result:
mylist[[]] <- sample(x = x, size = 20, replace = TRUE) # Gives an error
mylist[[1:length(mylist)]] <- sample(x = x, size = 20, replace = TRUE) # Does not give the desired result
mylist[1:length(mylist)] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
mylist[] <- sample(x = x, size = 20, replace = TRUE) # Gives the same undesired result as the previous line of code
P.S. As explained above, the desired result is a list of 20 elements, which individually contains 20 numeric values. I can do it using a for loop, but I would like to become a better R user and use vectorized operations as much as possible.
Thank you for your help.
Maybe replicate is what you're looking for.
mylist <- replicate(20, sample(x = x, size = 20, replace = TRUE), simplify=FALSE)
names(mylist) <- paste0("element", 1:20, sep = "")
Note that there is no need to first create a list, replicate will do it for you.
Since you're using replace=TRUE you could also generate all 400 at once and then split them up. If you were doing this many times, this probably would be faster than replicate. For only 20 times, the speed difference won't matter hardly at all and tje code using replicate is perhaps easier to read and understand and so might be preferred for that reason.
foo <- sample(x = x, size = 20*20, replace = TRUE)
mylist <- split(foo, rep(1:20, each=20))
Alternatively, you could split them by converting to a data frame first. Not sure which would be faster.
mylist <- as.list(as.data.frame(matrix(foo, ncol=20)))

R: t-test over all subsets over all columns

This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Resources