R: t-test over all subsets over all columns - r

This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )

First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.

You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Related

How to call multiple distribution functions from different vectors into a function in R

Lets talk you through my workflow:
General idea
Based on data in a dataframe, select the appropriate distribution functions, combine them in all possible ways to get the mean of the combined distributions.
Starting position
I have a large data frame df. In there I have different variables var1, var2 and var3 in this example which contains data to select the appropriate distribution function.
I have several distribution functions per variable:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
Select the right distribution
Using an if_else on each of the vars I generate the appropriate distribution per case in a new vector. The if_else looks like this for var1 and has the same appearance for all vars:
df$distr_var1 <- if_else(df$info < 0, "var1_distr1",
if_else(df$info > 0 & df$info < 100, "var1_distr2", "var1_distr3")
This results in the following df:
df <- data.frame(distr_var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
distr_var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
distr_var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
Combine distribution functions
To combine distribution functions in a new proportional distribution function I have created this function based on this question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
And I have stored the PDFs in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
I would like to use the function foo in the df to generate proportional distributions for all combinations of distributions given in the df. So, for each case, a the following combinations: var1_var2, var1_var3, var2_var3, var1_var2_var3.
Calculate mean over distributions
If I want to calculate a mean over the distributions individually, I can do this:
means <- sapply(PDFS, pdqr::summ_mean)
df$mean_var1 <- means[df$distr_var1]
Or:
df$mean_var2 <- sapply(mget(df$distr_var2), pdqr::summ_mean)
Both approaches work fine. But on the combinations var1_var2, var1_var3, var2_var3, var1_var2_var3 I have not found a suitable approach, but tried these:
df$var1_var2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$distr_var1, df$distr_var2))), pdqr::summ_mean)
I tried to overcome not calling functions by using a list, but things seem to get too complicated / nested to work nicely...
Question
How to select the appropriate distributions given in distr_var1, distr_var2 and distr_var3, combined them using foo and calculate the mean using pdqr::summ_mean?
I'm happy with all comments, also on the workflow in general
A foreach loop works for me:
df$var1_var2_mean <- foreach(i = 1:nrow(df), .combine = c) %do% {
A <- as.name(df$var1[i])
B <- as.name(df$var2[i])
mean <- summ_mean(foo(get(A),get(B)))
}
And, for each combination I need to do this. At least I got it working...

Add p-value column in qwraps::summary_table

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian
[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

R Loop: Perform a Function for Every 3 Rows

I have 2000 wheat plants, growing over the course of 40 days.
I'd like to perform the coeff function on each plant to find the coefficients of the quadratic equation the 3 time points make. (a, b, and c)
(1) The coef(lm(y~poly(x,2,raw=TRUE)) function works exactly the way I want it to.
(2) However, the way my data is presented, requires me to manually set x and y.
(3) Thus, I melted my data, and ordered it.
(4) I'd like to make a loop that will take the first three in column "Day" and set that as x. Then I'd like it to take the first three in column "Height" and set that as y.
Then I'd like to perform the coeff function.
Last I'd like it to present the coefficient outputs I need, preferably in a new data table.
Then repeat for every three rows, which represent each wheat ID, for all wheat plants.
1) This function works, giving me coefficients: a, b, c
x<-c(1,2,3)
y<-c(1,10,4)
coef(lm(y~poly(x,2,raw=TRUE)))
2) This is what my data originally looked like
A = matrix(c(5, 4, 2, 10, 10, 4, 5, 15, 6),nrow=3, ncol=3)
colnames(A)<-c("10", "25", "40")
rownames(A)<-c("Wheat 1", "Wheat 2", "Wheat 3")
A
3) This is my melted format
A.melted<-as.data.frame(melt(A, id.vars="ID"))
A.melted<-A.melted[with(A.melted,order(Var1)),]
colnames(A.melted) <- c("WheatID", "Day", "Height")
A.melted$Day<-as.numeric(as.character(A.melted$Day))
A.melted
#
4) This is what I am trying to do with my loop....
for every 3 rows,
x<-A.melted[,2]
y<-A.melted[,3]
coef(lm(y~poly(x,2,raw=TRUE)))
something to compile the coefficients: a, b, c
I am just not familiar with the syntax of loops, and I'd love any tips and suggestions. Perusing Google tells me that one should not do loops unless it is absolutely required since I may run into more problems- thus I am open to non loop techniques as well.
If you want to do it in a loop try this. The crucial part is to use seq together with a by = argument to let the index take the steps you need.
library(tibble)
df <- tibble(
WheatID = rep(NA_character_, nrow(A)),
Intercept = rep(NA_real_, nrow(A)),
poly1 = rep(NA_real_, nrow(A)),
poly2 = rep(NA_real_, nrow(A))
)
cnt <- 1
for (i in seq(1, nrow(A.melted), by = 3)) {
x <- A.melted$Day[i + 0:2]
y <- A.melted$Height[i + 0:2]
df$WheatID[cnt] <- as.character(A.melted$WheatID[i])
df[cnt, 2:4] <- coef(lm(y~poly(x,2,raw=TRUE)))
cnt <- cnt + 1
}
df
Note: I am not a data.table guy. Therefore, I present you with a tibble.
We can do this with the help of data.table, see ?data.table:
library(data.table)
A.models = A.melted[, model := list(.(lm(Height ~ poly(Day, 2),
data = list(.(.SD[WheatID == .BY[[1]]]))))),
by = WheatID]
A.models[, coefs := list(.(coefficients(model[[1]]))),
by = WheatID]
You can access each model like this:
A.models[WheatID == "Wheat 1", model[[1]]]
and even
A.models[WheatID == "Wheat 1", summary(model[[1]])]
The magic here happens because data.table takes in J expressions, not only functions.
This is something you can do with data.table package.
data.list <- split(A.melted, f = (1:nrow(A.melted) - 1) %/% 3)
coefs <- lapply(data.list, function(x) {
coefs <- coef(lm(Day ~ poly(Height, raw=TRUE), data = x))
data.table(
intercept = coefs[1],
poly.height = coefs[2]
)
})
coefs <- rbindlist(coefs)
Or you could perform apply() directly on the original matrix:
x <- as.numeric(colnames(A))
apply(A, 1, function(y) coef(lm(y~poly(x,2,raw=TRUE))))
Wheat 1 Wheat 2 Wheat 3
(Intercept) -3.88888889 -0.555555556 6.666667e-01
poly(x, 2, raw = TRUE)1 1.11111111 0.477777778 1.333333e-01
poly(x, 2, raw = TRUE)2 -0.02222222 -0.002222222 -2.417315e-18
Or you could transpose the data and use the coef(...) call directly:
x <- as.numeric(colnames(A))
coef(lm(t(A) ~ poly(x, 2, raw = TRUE)))

Create a matrix from a list consisting of unequal matrices for individual bootstraps

I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim

How to extract the coefficients of various n-th degree polynomial models and store them into a 1D array?

I have a collection of sixth-degree polynomial regression models from which I want to gather only the coefficients.
I have a large dataset that contains 3 columns: the first one is an arbitrary parameter that acts as a flag, the second is the input, and third is the output.
I subsetted my dataset according to my parameters, so I have 10 smaller datasets. My models arose from these subsets.
As an example:
#-----"Dummy" Dataset-----
a = seq(1:100) #act as input
b = a + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = as.data.frame(cbind(a,b))
colnames(df) = c("input", "output")
#
#-----Subsets-----
df_1_XlessThen50 = subset(df, x< 50) #example of subsetting. In this
#case I used the x values itself as threshold
#for subsetting just for simplicity.
#In reality, I use the first column of my dataframe(parameter).
df_2_XmoreThen50 = subset(df, x >= 50) #second subset. In other words,
#for every parameter, I will divide that subset
#into two smaller ones.
#
#-----Models-----
model_3_ab.1 = lm(output ~ poly(input, 6, raw = T), data = df_1_XlessThen50)
model_3_ab.2 = lm(output ~ poly(input, 6, raw = T), data = df_2_XmoreThen50)
My models's names follow a pattern: "model" + parameter + "_ab." + id number.
I should clarify that the "id number" indicates which of the two models for every parameter I will consider. (Theses smaller datasets within every parameter are the results of subsetting according to a pre-determined threshold.)
What I have now is a collection of models like these two above for every parameter in my dataset. I have 10 parameters, hence, 20 models.
I want to gather only the coefficients of every model and store them into a matrix or dataframe. To achieve that, I tried:
parameter = c(2,4,6,7,9,11,33,35,37,50)
myData = array()
for (i in parameter){ #Loop over all parameters
for (j in 1:2){ #Loop over the pair of models for each parameter
for ( k in 1:6){ #Loop over my model's coefficient
aux = paste("model",i,"ab.",j, sep = "")
aux = get(aux)
myData[i,j,k] = aux$coefficients[k]
}
}
}
However, I keep getting the same error:
Error in myData[i, j, k] = aux$coefficients[k] :
incorrect number of subscripts
With this error, I can't advance into my goal, which is to write a .txt with one single column formatted as such:
A(2,1,1) = first order coefficient for the first model related to parameter 2
B(2,2,1) = second order coefficient for the first model related to parameter 2
C(2,3,1)
...
G(2,7,1)
A(2,1,2)
where in (M, N, O): M is the parameter, N is the the coefficient of the N-th degree (N = 7 is the intercept), and O is either 1 or 2, respectively, the first or second model in each pair of models for every parameter.
It'd be nice to get help/guidance for the whole problem, but I'll already be grateful if I can get past the part where I want to store my coefficients in a matrix using for-loops. Thanks
Here is what I mean:
set.seed(42)
a1 = seq(1:100) #act as input
a2 <- runif(100)
b = a1 + a2 + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = data.frame(input1 = a1,
input2 = a2,
output = b)
df$flag <- a1 <= 50
library(reshape2)
df <- melt(df, id.vars = c("output", "flag"))
library(lme4)
df$flag_par <- interaction(df$flag, df$variable)
fits <- lmList(output ~ poly(value, 2, raw = TRUE) | flag_par, data = df)
coef(fits)
# (Intercept) poly(value, 2, raw = TRUE)1 poly(value, 2, raw = TRUE)2
#FALSE.input1 125.957730 -2.434849 0.022137337
#TRUE.input1 2.842223 1.216113 -0.006686362
#FALSE.input2 68.807752 -7.429319 26.486493218
#TRUE.input2 31.791633 -18.595105 16.608600876

Resources