Performing a large number of 2-sample t-tests in R - r

So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.
# data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)
# append the labels to the data, then filter
my_data$labels <- labels
sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)
#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}
return(p_vals)
This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.

Instead of splitting the samples you can use the formula interface to t.test, and sapply over the columns of my_data to conduct the tests:
p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )
This will make a vector of p-values, the order will be the same as the columns of my_data

You can also use the package genefilter:
library(genefilter)
colttests(as.matrix(my_data[,-ncol(my_data)]),factor(my_data$labels))

Related

Forecasting More than One Column

How do I run this for each of the 50 columns I have instead of one at a time?
#Chosen vector creation
IMBFM <- as.numeric(data$IMBFM)
#Hidden layers creation
alpha <- 1.5^(-10)
hn <- length(IMBFM)/(alpha*(length(IMBFM)+30))
#Fitting nnetar
lambda <- BoxCox.lambda(IMBFM)
dnn_pred <- nnetar(IMBFM, size= hn, lambda = lambda)
#Fitting nnetar
dnn_forecast <- forecast(dnn_pred, h= 30, PI = TRUE)
dnn_forecast
plot(dnn_forecast)
Create a function that takes your column, and returns a list of the forecast and the plot
f <- function(x) {
x <- as.numeric(x)
alpha <- 1.5^(-10)
hn <- length(x)/(alpha*(length(x)+30))
lambda <- BoxCox.lambda(x)
dnn_pred <- nnetar(x, size= hn, lambda = lambda)
dnn_forecast <- forecast(dnn_pred, h= 30, PI = TRUE)
return(
list("forecast" = dnn_forecast, "plot" = plot(dnn_forecast))
)
}
Create a vector of your columns of interest / many ways to do this; this is the manual way, but your might be able to use a regex on colnames(data) to select the ones of interest, depending on the names
mycols = c("IBMF", "col2", "col3", ... "col50")
Use lapply to apply the function to each element of mycols
result = lapply(mycols, function(col) data[[col]])

How to make outputs of models which are not in broom tidy in R

I have been trying to make the output of a wfe model as tidy so I can easily incorporate it into ggplot and etc. This is a problem I've had with other packages and statistical models which are not included in broom.
So let's say I create a dataset like this: (taken from wfe's file):
library (wfe)
## generate panel data with number of units = N, number of time = Time
N <- 10 # number of distinct units
Time <- 15 # number of distinct time
## treatment effect
beta <- 1
## generate treatment variable
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
## make sure at least one observation is treated for each unit
while ((sum(apply(treat, 2, mean) == 0) > 0) | (sum(apply(treat, 2, mean) == 1) > 0) |
(sum(apply(treat, 1, mean) == 0) > 0) | (sum(apply(treat, 1, mean) == 1) > 0)) {
treat <- matrix(rbinom(N*Time, size = 1, 0.25), ncol = N)
}
treat.vec <- c(treat)
## unit fixed effects
alphai <- rnorm(N, mean = apply(treat, 2, mean))
## geneate two random covariates
x1 <- matrix(rnorm(N*Time, 0.5,1), ncol=N)
x2 <- matrix(rbeta(N*Time, 5,1), ncol=N)
x1.vec <- c(x1)
x2.vec <- c(x2)
## generate outcome variable
y <- matrix(NA, ncol = N, nrow = Time)
for (i in 1:N) {
y[, i] <- alphai[i] + treat[, i] + x1[,i] + x2[,i] + rnorm(Time)
}
y.vec <- c(y)
## generate unit and time index
unit.index <- rep(1:N, each = Time)
time.index <- rep(1:Time, N)
Data.obs <- as.data.frame(cbind(y.vec, treat.vec, unit.index, time.index, x1.vec, x2.vec))
colnames(Data.obs) <- c("y", "tr", "unit", "time", "x1", "x2")
Now I run a model from the function wfe (again, code from the package's help file):
mod.did <- wfe(y~ tr+x1+x2, data = Data.obs, treat = "tr",
unit.index = "unit", time.index = "time", method = "unit",
qoi = "ate", estimator ="did", hetero.se=TRUE, auto.se=TRUE,
White = TRUE, White.alpha = 0.05, verbose = TRUE)
## summarize the results
summary(mod.did)
My question is how to turn this output into a tidy object I could plot.
If I call tidy(mod.did) I get the following error:
Error: No tidy method for objects of class wfedid
Which I understand, but I am unsure as to how to solve. I tried mapping the individual parameters (coefficient, se, etc.) into a new list object but that did not work, so I hope that someone here knows of a more systematic way of doing this.
In case it helps, here's a dput of the output: https://pastebin.com/HTkKEUUQ
Thanks!
Here's a start at a tidy method:
library(dplyr); library(tibble)
tidy.wfedid <- function(x, conf.int=FALSE, conf.level=0.95, ...) {
cc <- (coef(summary(x))
%>% as.data.frame()
%>% setNames(c("estimate","std.error","statistic","p.value"))
%>% tibble::rownames_to_column("term")
%>% as_tibble()
)
return(cc)
}
Note that (1) I haven't implemented the confidence interval stuff (you could do this by using mutate to add columns (conf.low, conf.high) = term ± std.error*qnorm((1+conf.level)/2); (2) this gives the standard "tidy" method, which gives a coefficient table. If you want predictions and confidence intervals on predictions you will need to write an augment method ...

Applying lm() using sapply or lapply

So I'm trying to use lm() with sapply.
#example data and labels
data <- matrix(data = runif(1000), nrow = 100, ncol = 10))
markers <- sample(0:1, replace = T, size = 100)
# try to get linear model stuff
Lin <- sapply(data, function(x) lm(unlist(markers) ~ unlist(x))$coefficients)
MY problem is that this gives me coefficients for 1000 equations rather than 10
You need to supply sapply with a data frame, not a matrix.
#example data and labels
data <- data.frame(matrix(data = runif(1000), nrow = 100, ncol = 10))
markers <- sample(0:1, replace = T, size = 100)
# try to get linear model stuff
sapply(data, function(x) coef(lm(markers ~ x)))
sapply(data, function(x) coef(lm(markers ~ x))[-1]) # Omit intercepts
X1.x X2.x X3.x X4.x X5.x
0.017043626 0.518378546 -0.011110972 -0.145848478 0.335232991
X6.x X7.x X8.x X9.x X10.x
0.015122184 0.001985933 0.191279594 -0.077689961 -0.107411203
Your original matrix fails:
data <- matrix(data = runif(1000), nrow = 100, ncol = 10)
sapply(data, function(x) coef(lm(markers ~ x)))
# Error: variable lengths differ (found for 'x')
Because sapply, which calls lapply, will convert its first argument, X, to a list using as.list before performing the function. But as.list applied to a matrix results in list with length equal to the number of entries in the matrix, in your case 1,000. as.list when applied to a data frame results in a list with length equal to the number of columns of the data frame, in your case 10, with the elements containing the values in each column.
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x000002397f5ce508>
<environment: namespace:base>

How to extract the p.value and estimate from cor.test() in a data.frame?

In this example, I have temperatures values from 50 different sites, and I would like to correlate the Site1 with all the 50 sites. But I want to extract only the components "p.value" and "estimate" generated with the function cor.test() in a data.frame into two different columns.
I have done my attempt and it works, but I don't know how!
For that reason I would like to know how can I simplify my code, because the problem is that I have to run two times a Loop "for" to get my results.
Here is my example:
# Temperature data
data <- matrix(rnorm(500, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
# Extraction
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[,2:3] <- df1[c("estimate", "p.value")]
}
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[i,2:3] <- df1[c("estimate", "p.value")]
}
df
I will appreciate very much your help :)
I might offer up the following as well (masking the loops):
result <- do.call(rbind,lapply(2:50, function(x) {
cor.result<-cor.test(data[,1],data[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)
First of all, I'm guessing you had a typo in your code (you should have rnorm(5000 if you want unique values. Otherwise you're going to cycle through those 500 numbers 10 times.
Anyway, a simple way of doing this would be:
data <- matrix(rnorm(5000, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
estimates = numeric(50)
pvalues = numeric(50)
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
df$Estimate <- estimates
df$P.value <- pvalues
df
Edit: I believe your issue was is that in the line df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="") if you do typeof(df$Estimate), you see it's expecting an integer, and typeof(test$estimate) shows it spits out a double, so R doesn't know what you're trying to do with those two values. you can redo your code like thus:
df <- data.frame(label=paste("Site", 1:50), Estimate=numeric(50), P.value=numeric(50))
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
df$Estimate[i] = test$estimate
df$P.value[i] = test$p.value
}
to make it a little more concise.
similar to the answer of colemand77:
create a cor function:
cor_fun <- function(x, y, method){
tmp <- cor.test(x, y, method= method)
cbind(r=tmp$estimate, p=tmp$p.value) }
apply through the data.frame. You can transpose the result to get p and r by row:
t(apply(data, 2, cor_fun, data[, 1], "spearman"))

p-value matrix of x and y variables from anova output

I have many X and Y variables (something like, 500 x 500). The following just small data:
yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
X2 = sample (c(1,0, -1), 100, replace = T),
Xv3 = sample (c(1,0, -1), 100, replace = T),
D = sample (c(1,0, -1), 100, replace = T))
I want to extact p-values and make a matrix like this:
Yv1 Y2 Yv3
Xv1
X2
Xv3
D
Here is my attempt to loop the process:
prob = NULL
anova.pmat <- function (x) {
mydata <- data.frame(yvar = yvars[, x], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
sapply (yvars,anova.pmat)
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?
Edit:
For the first Y variable:
For first Y variable:
prob <- NULL
mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
prob
[1] 0.4995179 0.4067040 0.4181571 0.6291167
Edit again:
for (j in seq(length (yvars))){
prob <- NULL
mydata <- data.frame(yvar = yvars[, j], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
Gives the same result as above !!!
Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.
library("plyr")
probs <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)
Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:
dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))
Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).

Resources