I want to perform multiple ttests on data in the following format
first column is "id"
with values (for example) 1,1,1,2,2,2
second column is "ratios"
with values 0.2, 0.18, 0.3, 1.5, 1.4, 1.6
for each instance of "id" I want to test all ratio values against all the ratio values in the dataframe
Right now I have this
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test
dist <- as.numeric(readClipboard()) ##distribution to test against
data$Ratio.Mean.H.L <- NA
data$p.value <- NA
for (i in 1:nrow(data))
if (nrow(data) > 1)
{
#welsh t-test
t.test.result <- t.test(data$ratio[i],dist,
alternative = "two.sided",
mu = 0,
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
#writes data into the data.frame
data$p.value[i] <- t.test.result$p.value
}
write.table(data, file="C:/R_Temp/t-test.txt", sep = "\t")
I know this does not work, for one I am not sure I am only testing rows that share the same "id". I am also manually entering the distribution to test against, which is all entries in the "ratio" column.
How do I do this correct? and add multiple testing correction (bonferroni)?
I suspect that MattParker's comment is going to be the biggest thing here: you are comparing a single number with a vector, and t.test will complain about that. Since you suggested that you want to perform tests per grouping variable (id), so in base R you probably want to use a function like by (or split). (There are great methods within dplyr and data.table as well.)
Using mtcars as sample data, I'll try to mimic your data:
dat <- mtcars[c("cyl", "mpg")]
colnames(dat) <- c("id", "ratio")
It isn't clear what you mean to use for dist, so I'll use the naïve
dist <- 1:10
Now you can do:
by(dat$ratio, dat$id, function(x) t.test(x, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
If you want/need to deal with more than just ratio at a time, you can alternatively do this:
by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# dat$id: 4
# [1] 2.660716e-10
# ------------------------------------------------------------
# dat$id: 6
# [1] 4.826322e-09
# ------------------------------------------------------------
# dat$id: 8
# [1] 2.367184e-07
The results from the call to by are a class "by", which is really just a repackaged list with some extra attributes:
res <- by(dat, dat$id, function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
class(res)
# [1] "by"
str(attributes(res))
# List of 4
# $ dim : int 3
# $ dimnames:List of 1
# ..$ dat$id: chr [1:3] "4" "6" "8"
# $ call : language by.data.frame(data = dat, INDICES = dat$id, FUN = function(x) t.test(x$ratio, dist, paired = FALSE)$p.value)
# $ class : chr "by"
So you can expand/access it however you would a list:
res[[1]]
# [1] 2.660716e-10
as.numeric(res)
# [1] 2.660716e-10 4.826322e-09 2.367184e-07
names(res)
# [1] "4" "6" "8"
(Realize that the different levels of dat$id are the integers 4, 6, and 8, so the names should correspond to your $id.)
Edit:
If you want the results in a data.frame, two options come to mind:
Repeat the p-value for each and every row, resulting in a lot of duplication. I discourage this method for several reasons; if you need it at some point, I suggest using option 2 and then merge.
Produce a data.frame with as many rows as unique id. Something like:
do.call(rbind.data.frame,
by(dat, dat$id, function(x) list(id=x$id[1], pv=t.test(x, dist, paired=F)$p.value)))
# id pv
# 4 4 1.319941e-03
# 6 6 2.877065e-03
# 8 8 6.670216e-05
OK, Sorry for the poorly defined question. I got help elsewhere and will post the script that worked for those that are interested. I want to calculate p-values for ratio changes in a proteomics experiment. To do this I make individual t-tests for all the ratio measurements for any given protein or PTM site.These measurements are compared to the median of all measurments (mu in the t.test function), or to the entire distribution of measurements. In one column I have "id"s which are unique for each entry, in the other column I have "values" (ratios). I will make t-tests comparing all "values" that occur with any given unique "id". for ease of use I paste the table into the script, rather than calling it from a file (it saves me a step).
data <- read.delim("clipboard", stringsAsFactors=FALSE) ##data to test(two columns "id" and "value") Log-transfrom ratios!!
summary(data)
med <- median(data$value)
# function for the id-grouped t-test
calc_id_ttest <- function(d) #col1: id, col2:values
{
colnames(d) <- c("id", "value") # reassign the column names
# calculate the number of values for each id
res_N <- as.data.frame(tapply(d$value, d$id, length))
colnames(res_N) <- "N"
res_N$id <- row.names(res_N)
# calculate the number of values for each id
res_med <- as.data.frame(tapply(d$value, d$id, median))
colnames(res_med) <- "med"
res_med$id <- row.names(res_med)
# calculate the pvalues
res_pval <- as.data.frame(tapply(d$value, d$id, function(x)
{
if(length(x) < 3)
{ # t test requires at least 3 samples
NA
}
else
{
t.test(x, mu=med)$p.value #t.test (Pearson)d$value with other distribution? alternative=less or greater
} #d$value to compare with entire distribution
#mu=med for median of values for 1-sided test
}))
colnames(res_pval) <- "pval" # nominal p value
res_pval$id <- row.names(res_pval)
res_pval$adj.pval <- p.adjust(res_pval$pval, method = "BH") #multiple testing correction also "bonferroni"
res <- Reduce(function(x,y)
{
merge(x,y, by = "id", all = TRUE)
},
list(res_N, res_med, res_pval))
return (res)
}
data_result <- calc_id_ttest(d = data)
write.table(data_result, file="C:/R_Temp/t-test.txt", quote = FALSE, row.names = FALSE, col.names = TRUE, sep = "\t")
Related
I have the below dataset, where I am trying to do a rolling 3 days correlation across x,y,z,a. So the code should do rolling correlations of xy,xz,xa, yx, yz,ya and so on. Also, as you can see below, the data for y and a is incomplete, but I would wish to do rolling correlations of them starting from the date where they first had values (i.e. id 3 and id 4).
How should I accomplish this? Don't know where to start...
set.seed(42)
n <- 10
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-22"), as.Date("2020-12-31"), "day"),
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
a=rnorm(n))
dat$y[1:2] <- NA
dat$a[1:3] <- NA
I am able to find this set of code from stack, but it only helps in finding the answer for 1st column and not all the columns
rollapplyr(x, 5, function(x) cor(x[, 1], x[, -1]), by.column = FALSE)
Create a data frame with only the columns wanted and then use rollapplyr with cor. cor takes a use= argument that specifies how missing values are to be handled. See ?cor for the values it can take since you may or may not wish to use the value we used below.
The result r is a matrix whose i-th row describes the correlation matrix of the 5 dat2 rows ending in and including row i. That is, matrix(r[i, ], 4, 4) is the correlation matrix of dat2[i-(4:0), ].
We can also create ar which is a 3d array which is such that ar[i,,] is the correlation matrix of the 5 rows of dat2 ending in and including row i.
That is these are equal for each i in 5, ..., nrow(dat2). (The first 4 rows of r are all NA since there do not exist 5 rows leading to those rows.)
1. cor(dat2[i-(4:0), ], use = "pairwise")
2. matrix(r[i, ], 4, 4)
3. ar[i,,]
We run checks for these equivalences for i=5 below.
library(zoo)
w <- 5
dat2 <- dat[c("x", "y", "z", "a")]
nr <- nrow(dat2)
nc <- ncol(dat2)
r <- rollapplyr(dat2, w, cor, use = "pairwise", by.column = FALSE, fill = NA)
colnames(r) <- paste(names(dat2)[c(row(diag(nc)))],
names(dat2)[c(col(diag(nc)))], sep = ".")
ar <- array(r, c(nr, nc, nc),
dimnames = list(NULL, names(dat2), names(dat2)))
# run some checks
cor5 <- cor(dat2[1:w, ], use = "pairwise") # cor of 1st w rows
# same except for names
all.equal(unname(cor5), matrix(r[w, ], nc))
## [1] TRUE
all.equal(cor5, ar[w,,])
## [1] TRUE
The above shows a matrix whose rows are strung out correlation matrices and a 3d array whose slices are correlation matrices. Another possibility for output is to create a list of correlation matrices.
lapply(1:nr, function(i) {
if (i >= w) cor(dat2[i-((w-1):0), ], use = "pairwise")
})
combn produces all the combinations.
cols <- c("x", "y", "z", "a")
combn(cols, 2)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "x" "x" "x" "y" "y" "z"
# [2,] "y" "z" "a" "z" "a" "a"
combn has a function argument where you first na.omit all rows with NA's. Then subset with mapply over incrementing sequences 1:3 and calculate correlations, until nrow is reached.
w <- 3 ## size of the rolling window
combn(dat[cols], 2, function(x) {
X <- na.omit(x)
n <- nrow(X)
mapply(function(y, z) cor(X[y + z, 1], X[y + z, 2]), list(1:w), 0:(n - w))
}, simplify=FALSE)
# [[1]]
# [1] 0.5307784 -0.9874843 -0.8364802 0.2407730 0.3655328 -0.4458231
#
# [[2]]
# [1] 0.8121466 0.9652715 0.3304100 0.8278965 -0.1425097 0.5832558 0.9959705
# [8] 0.8696023
#
# [[3]]
# [1] 0.6733985 0.2194488 0.5593983 -0.6589249 -0.9291184
#
# [[4]]
# [1] 0.97528684 -0.90599558 -0.42319742 0.92882443 0.28058418 0.05427966
#
# [[5]]
# [1] -0.7815678 -0.7182037 -0.6698260 0.4592962 0.7452225
#
# [[6]]
# [1] 0.9721521 0.9343926 -0.3470329 -0.7237291 -0.6253825
Using the r package "effsize" I am trying to calculate cohens d between all pairs of groups in my data outputting all the pairwise d estimates as a matrix. I have provided some test data to illustrate this. I would want a matrix of d estimates for all pairs of groups 1, 2, and 3.
I am struggling to find where to start with this. I know that it could be done using loops but since my real data contains 1000 groups each with 6000 data points I think this would be slow.
library("effsize")
test <- data.frame(
score=c(2,3,42,1,2,3,4,5,5,6,8,2),
group=c(1,1,1,1,2,2,2,2,3,3,3,3)
)
This would be similar functionality to what is provided for wilcox rank sum using pairwise.wilcox.test().
All you have to do is to note that function combn outputs the combinations of n elements taken k at a time and can also apply a function to each resulting combination. In this case the question asks for combinations of 2 groups at a time and the function fun is applied to each one.
fun <- function(x) {
cohen.d(x[[1]]$score, x[[2]]$score)
}
sp <- split(test, test$group)
cmb <- combn(sp, 2, fun)
cmb[, 1]
#[[1]]
#[1] "Cohen's d"
#
#[[2]]
#[1] "d"
#
#[[3]]
#[1] 0.5992954
#
#[[4]]
# lower upper
#-1.169345 2.367936
#
#[[5]]
#[1] 0.95
#
#[[6]]
#[1] medium
#Levels: negligible < small < medium < large
The code above can be written as a function that does all the work and returns a matrix.
cohen.d.pairwise.test <- function(DF, scoreCol, groupCol){
fun <- function(x) {
eff <- cohen.d(x[[1]][[scoreCol]], x[[2]][[scoreCol]])
c(eff[["estimate"]],
eff[["conf.int"]][1],
eff[["conf.int"]][2],
eff[["conf.level"]])
}
sp <- split(DF, DF[[groupCol]])
cmb <- combn(sp, 2, fun)
rownames(cmb) <- c("estimate", "lower", "upper", "conf.level")
t(cmb)
}
cohen.d.pairwise.test(test, scoreCol = "score", groupCol = "group")
# estimate lower upper conf.level
#[1,] 0.5992954 -1.169345 2.3679357 0.95
#[2,] 0.4732232 -1.281054 2.2275008 0.95
#[3,] -0.8795932 -2.691556 0.9323698 0.95
Let's assume four data frames, each with 3 vectors, e.g.
setA <- data.frame(
a1 = c(6,5,2,4,5,3,4,4,5,3),
a2 = c(4,3,1,4,5,1,1,6,3,2),
a3 = c(5,4,5,6,4,6,5,5,3,3)
)
setB <- data.frame(
b1 = c(5,3,4,3,3,6,4,4,3,5),
b2 = c(4,3,1,3,5,2,5,2,5,6),
b3 = c(6,5,4,3,2,6,4,3,4,6)
)
setC <- data.frame(
c1 = c(4,4,5,5,6,4,2,2,4,6),
c2 = c(3,3,4,4,2,1,2,3,5,4),
c3 = c(4,5,4,3,5,5,3,5,5,6)
)
setD <- data.frame(
d1 = c(5,5,4,4,3,5,3,5,5,4),
d2 = c(4,4,3,3,4,3,4,3,4,5),
d3 = c(6,5,5,3,3,4,2,5,5,4)
)
I'm trying to find n number of vectors in each data frame, that have the highest correlation among each other. For this simple example, let's say want to find the n = 1 vectors in each of the k = 4 data frames, that show the overall strongest, positive correlation cor().
I'm not interested in the correlation of vectors within a data frame, but the correlation between data frames, since i wish to pick 1 variable from each set.
Intuitively, I would sum all the correlation coefficients for each combination, i.e.:
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...
...but this seems like brute-forcing a solution that might be solvable more elegantly, with some kind of clustering-technique?
Anyhow, I was hoping to find a dynamic solution like function(n = 1, ...) where (... for data frames) which would return a list of the highest correlating vector names.
Base on your example I would not go with a really complicated algorithm unless your actual data is huge. This is a simple approach I think gets what you want.
So base on your 4 data frames a creates the list_df and then in the function I just generate all the possible combinations of variables an calculate their correlation. At the end I select the n combinations with highest correlation.
list_df = list(setA,setB,setC,setD)
CombMaxCor = function(n = 1,list_df){
column_names = lapply(list_df,colnames)
mat_comb = expand.grid(column_names)
mat_total = do.call(cbind,list_df)
vec_cor = rep(NA,nrow(mat_comb))
for(i in 1:nrow(mat_comb)){
vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
}
pos_max_temp = rev(sort(vec_cor))[1:n]
pos_max = vec_cor%in%pos_max_temp
comb_max_cor = mat_comb[pos_max,]
return(comb_max_cor)
}
You could use comb function:
fun = function(x){
nm = paste0(names(x),collapse="")
if(!grepl("(.)\\d.*\\1",nm,perl = T))
setNames(sum(cor(x)),nm)
}
unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
a1b1c1d1 a1b1c1d2 a1b1c1d3
3.246442 4.097532 3.566949
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
[1] 3.246442
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
[1] 4.097532
sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
[1] 3.566949
Here is a function we can use to get n non-repeating columns from each data frame to get the max total correlation:
func <- function(n, ...){
list.df <- list(...)
n.df <- length(list.df)
# 1) First get the correlations
get.two.df.cors <- function(df1, df2) apply(df1, 2,
function(x) apply(df2, 2, function(y) cor(x,y))
)
cor.combns <- lapply(list.df, function(x)
lapply(list.df, function(y) get.two.df.cors(x,y))
)
# 2) Define function to help with aggregating the correlations.
# We will call them for different combinations of selected columns from each df later
# cmbns: given a df corresponding columns to be selected each data frame
# (i-th row corresponds to i-th df),
# return the "total correlation"
get.cmbn.sum <- function(cmbns, cor.combns){
# a helper matrix to help aggregation
# each row represents which two data frames we want to get the correlation sums
df.df <- t(combn(seq(n.df), 2, c))
# convert to list of selections for each df
cmbns <- split(cmbns, seq(nrow(cmbns)))
sums <- apply(df.df, 1,
function(dfs) sum(
cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]]
)
)
# sum of the sums give the "total correlation"
sum(sums)
}
# 3) Now perform the aggragation
# get the methods of choosing n columns from each of the k data frames
if (n==1) {
cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
} else {
cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
}
# get all unique selection methods
unique.selections <- Reduce(function(all.dfs, new.df){
all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
for(i in seq(nrow(new.df))){
for(j in seq(length(all.dfs.lst[[i]]))){
all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
}
}
do.call(c, all.dfs.lst)
}, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
# for each unique selection method, calculate the total correlation
result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
return( unique.selections[[which.max(result)]] )
}
And now we have:
# n = 1
func(1, setA, setB, setC, setD)
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 2
# n = 2
func(2, setA, setB, setC, setD)
# [,1] [,2]
# [1,] 1 2
# [2,] 2 3
# [3,] 2 3
# [4,] 2 3
I have a database with columns theme (value 0 or 1), level (value 1 to 9) and startTime (double value). For every level, I want to perform a t-test on the startTime values. Here is my code:
database <- read.csv("database.csv")
themeData <- database[database$theme == 1, ]
noThemeData <- database[database$theme == 0, ]
for (i in 1:9) {
x <- themeData[themeData$level == i, ]
y <- noThemeData[noThemeData$level == i, ]
t.test(x$startTime,y$startTime,
alternative = "less")
}
Unfortunately, no t-tests are being executed. In the end, x and y simply get the value for i=9. What am I doing wrong?
Your code is doing busy work: it is doing the calculations of the t.test, but since for loops always discard their implied results, you aren't storing it anywhere. You would have had to use a vector or list (pre-allocated is always better) like so:
res <- replicate(9, NULL)
for (i in 1:9) {
x <- themeData[themeData$level == i, ]
y <- noThemeData[noThemeData$level == i, ]
res[[i]] <- t.test(x$startTime,y$startTime,
alternative = "less")
}
res[[2]]
This can be "good enough" in that it is saving all test "results objects" in a list for later processing/consumption. A slightly better method is to use one of the *apply functions; the first two I think of that are directly applicable here (lapply, sapply(..., simplify=FALSE)) have various minor advantages, frankly you can choose either.
res <- lapply(c(4, 6, 8), function(thiscyl) {
am0 <- subset(mtcars, am == 0 & cyl == thiscyl)
am1 <- subset(mtcars, am == 1 & cyl == thiscyl)
t.test(am0$mpg, am1$mpg)
})
This is especially beneficial if (unlike here) the tests take a long time: you perform the test and preserve the models, so you can so lots of things to the results without having to rerun the tests. For instance, if you wanted just the p-values:
sapply(res, `[`, "p.value")
# $p.value
# [1] 0.01801712
# $p.value
# [1] 0.187123
# $p.value
# [1] 0.7038727
or more tersely:
sapply(res, `[[`, "p.value")
# [1] 0.01801712 0.18712303 0.70387268
Another example, the confidence intervals, in a matrix:
t(sapply(res, `[[`, "conf.int"))
# [,1] [,2]
# [1,] -9.232108 -1.117892
# [2,] -3.916068 1.032735
# [3,] -2.339549 1.639549
You can always look at a single model with, say, res[[2]], but if you need to see all of them you can use just res and see the whole gamut.
res[[2]]
# Welch Two Sample t-test
# data: am0$mpg and am1$mpg
# t = -1.5606, df = 4.4055, p-value = 0.1871
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -3.916068 1.032735
# sample estimates:
# mean of x mean of y
# 19.12500 20.56667
I have a data frame with 1000 rows and I want to perform some operation on it with 100 rows at a time.
So, I am trying to find out how would I use a counter increment on the number of rows and select 100 rows at a time like 1 to 100, then 101 to 200... uptil 1000 and perform operation on each subset using a for loop. Can anyone please suggest what how can this be done as I could not find out a good method.
An easy way would be to create a grouping variable, then use split() and lapply() to do whatever operations you need to.
Your grouping can be easily created using rep().
Here is an example:
set.seed(1)
demo = data.frame(A = sample(300, 50, replace=TRUE),
B = rnorm(50))
demo$groups = rep(1:5, each=10)
demo.split = split(demo, demo$groups)
lapply(demo.split, colMeans)
# $`1`
# A B groups
# 165.9000000 -0.1530186 1.0000000
#
# $`2`
# A B groups
# 168.2000000 0.1141589 2.0000000
#
# $`3`
# A B groups
# 126.0000000 0.1625241 3.0000000
#
# $`4`
# A B groups
# 159.4000000 0.3340555 4.0000000
#
# $`5`
# A B groups
# 181.8000000 0.0363812 5.0000000
If you prefer to not add the groups to your source data.frame, you can achieve the same effect by doing the following:
groups = rep(1:5, each=10)
lapply(split(demo, groups), colMeans)
Of course, replace colMeans with whatever function you want.
Using your example of a data.frame with 1000 rows, your rep() statement should be something like:
rep(1:10, each=100)
The answer from #mrdwab is great and shows how to avoid a for loop. But if you really must use a for loop (the biglm package would be one example where you might want to) then here is one approach:
for( i in seq(1,1000,by=100) ) {
myfun( df[ i:(i+99), ] )
}
If the total number of rows is not a multiple of the block size then you might want something more like:
tmp <- seq( 1, nrow(df), by=100 )
tmp2 <- c( tail( tmp, -1)-1, nrow(df) )
n <- length(tmp)
out <- numeric(n)
for( i in seq_along(tmp) ) {
out[i] <- myfun( df[ tmp[i]:tmp2[i], ] )
}