How to perform wilcox test in R? - r

I have this data frame with 4 genes and 3 samples measured in duplicate.
The TS is the standard.
I want to perform the wilcox test between the sample S1 with TS and S2 with the TS for each protein, but i´m having problems with the for cycle.
MS.rawMV <- read.table("C:/Users/aaa/Desktop/genomic/MS.csv", header=T)
S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene 1 1 1 2 3 5 5
gene 2 10 10 4 5 9 10
gene 3 5 6 4 4 5 7
gene 4 9 9 8 7 6 6
Samples=list(
S1=grep("S1_*", colnames(MS.rawMV), value=TRUE),
S2=grep("S2_*", colnames(MS.rawMV), value=TRUE),
TS=grep("TS_*", colnames(MS.rawMV), value=TRUE))
sample.names <- names(Samples)
ref.sample <- "TS_"
# Build a data.frame
GRates <- data.frame(MS.rawMV[Reduce("c", Samples)])
## Statistics: non parametric test using TS as a standart
for (i in names(Samples)) {
WILCOXTEST <- wilcox.test(GRates[c(Samples[[i]])],Samples[[ref.sample]])
pnames <- paste(i,".wilcoxtest",sep="")
GRates[pnames] <- WILCOXTEST["p.value"]
}
Error in wilcox.test.default(GRates[Samples[[i]]], Samples[[ref.sample[i]]]) :
'x' must be numeric

It looks like the data is being treated as a factor.
The easiest fix would be to convert them back to numeric via factor->character->numeric.
try this
wilcox.test(
as.numeric(as.character(GRates[c(Samples[[i]])])),
as.numeric(as.character(Samples[[ref.sample]]))
)
If you try to convert straight to numeric from factor, you'll end up with integers that represent the factor classes instead of the actual values.

#DWin's comment is well taken (you have additional structure in your data that is hard to incorporate into a Wilcoxon test). However, if you want to ignore the distinction between the _1 and _2 columns and run Wilcoxon test on S1 vs TS and S2 vs TS, here's a way to rearrange the data and do it:
dat <- read.table(text="
gene S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
1 1 1 2 3 5 5
2 10 10 4 5 9 10
3 5 6 4 4 5 7
4 9 9 8 7 6 6",
header=TRUE)
library(reshape2)
library(plyr)
m1 <- melt(dat,id.var="gene")
## break var_num into separate components
m2 <- subset(data.frame(m1,
colsplit(m1$variable,"_",names=c("var","num"))),
select=-variable)
## combine treatments with standards
m3 <- merge(subset(m2,var!="TS"),
subset(m2,var=="TS"),by=c("gene","num"))
## clean up
m4 <- subset(rename(m3,c(value.x="value",var.x="var",value.y="standard")),
select=-var.y)
## apply Wilcoxon test to each component, save the p value
ddply(m4,"var",
function(x) with(x,wilcox.test(value,standard))$p.value)
Or, if you want to test each replication separately (as in #agstudy's answer), do
ddply(m4,c("var","num"),
function(x) with(x,wilcox.test(value,standard))$p.value)
instead.

I think , since wilcox.test is not vectorized you need 2 loops. Even I am not sure Of the statistical meaning of this , here how you can do :
nn <- colnames(dat)
lapply(1:2,function(x){
col.L <- grep(paste0('S',x,'_*'),nn)
col.R <- dat[,paste0('TS_',x)]
lapply(col.L,function(y)
wilcox.test(dat[,y],col.R)['p.value'])
})
Here I assume dat as
dat <- read.table(text='S1_1 S1_2 S2_1 S2_2 TS_1 TS_2
gene_1 1 1 2 3 5 5
gene_2 10 10 4 5 9 10
gene_3 5 6 4 4 5 7
gene_4 9 9 8 7 6 6',header=TRUE)

Related

Build a data frame with overlapping observations

Lets say I have a data frame with the following structure:
> DF <- data.frame(x=1:5, y=6:10)
> DF
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I need to build a new data frame with overlapping observations from the first data frame to be used as an input for building the A matrix for the Rglpk optimization library. I would use n-length observation windows, so that if n=2 the resulting data frame would join rows 1&2, 2&3, 3&4, and so on. The length of the resulting data frame would be
(numberOfObservations-windowSize+1)*windowSize
The result for this example with windowSize=2 would be a structure like
x y
1 1 6
2 2 7
3 2 7
4 3 8
5 3 8
6 4 9
7 4 9
8 5 10
I could do a loop like
DFResult <- NULL
numBlocks <- nrow(DF)-windowSize+1
for (i in 1:numBlocks) {
DFResult <- rbind(DFResult, DF[i:(i+horizon-1), ])
}
But this seems vey inefficient, especially for very large data frames.
I also tried
rollapply(data=DF, width=windowSize, FUN=function(x) x, by.column=FALSE, by=1)
x y
[1,] 1 6
[2,] 2 7
[3,] 2 7
[4,] 3 8
where I was trying to repeat a block of rows without applying any aggregate function. This does not work since I am missing some rows
I am a bit stumped by this and have looked around for similar problems but could not find any. Does anyone have any better ideas?
We could do a vectorized approach
i1 <- seq_len(nrow(DF))
res <- DF[c(rbind(i1[-length(i1)], i1[-1])),]
row.names(res) <- NULL
res
# x y
#1 1 6
#2 2 7
#3 2 7
#4 3 8
#5 3 8
#6 4 9
#7 4 9
#8 5 10

Deleting columns in a data frame using a list of variable names

I have an automated script that produces a standard formula (i.e., y~x1+x2) and I would like to screen my data out based on those variables.
So far I have gotten this far, but I hit a sticking point where I can't quite figure it out:
#Example data
df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
df
x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
#Example formula
ex_form = "x~y+u"
#Delete the ~ and add a + sign to be consistent
step1 = gsub("~","+", ex_form)
#Remove + signs
step2 = strsplit(step1, "\\+")
#Final list of variables
step3 = unlist(step2)
Most solutions I've seen is something along the lines of:
#Create list of variables
mylist = c("x", "y", "u")
#Cut data
temp = df[ ,mylist]
temp
x y u
1 1 2 4
2 2 3 5
3 3 4 6
4 4 5 7
5 5 6 8
But this solution doesn't quite fit into the automation...so I need to jump from what I have to that outcome. Any thoughts?
Note: Tags are my guesses.
If you don't put your formula between " " it will be recognized as such, and can use all.vars() to extract variables from it.
ex_form = x~y+u #Without quotes it is a formula, check str(ex_form)
df[, all.vars(ex_form)]
# x y u
#1 1 2 4
#2 2 3 5
#3 3 4 6
#4 4 5 7
#5 5 6 8
Am I missing something or does simply doing temp <- df[,step3] return exactly what you say you want?

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

Permutation Testing

I have the following data set for which I've written some code to do permutation testing
df <- read.table(text="Group var1 var2 var3 var4 var5
1 3 5 7 3 7
1 3 7 5 9 6
1 5 2 6 7 6
1 9 5 7 0 8
1 2 4 5 7 8
1 2 3 1 6 4
2 4 2 7 6 5
2 0 8 3 7 5
2 1 2 3 5 9
2 1 5 3 8 0
2 2 6 9 0 7
2 3 6 7 8 8
2 10 6 3 8 0", header = TRUE)
This is my code. However it doesn't seem to work for some reason - all the p values I get at the end are about 0.5. Can anyone see what I'm doing wrong??
data = df[,2:6]
t.test.pvals = matrix(NA,nrow=1000,ncol=5)
ids.group1 = c(1,2,3,4,5,6)
ids.group2 = c(7,8,9,10,11,12,13)
#Define binary vector type for the t test
group1.binary <- rep(0,times=6)
group2.binary <- rep(1,times=7)
type <- c(group1.binary,group2.binary)
#Permutation testing
for (i in 1:1000) {
index = sample(1:13, size=13, replace=F)
group1 = data[which(index %in% ids.group1),]
group2 = data[which(index %in% ids.group2),]
group.total = rbind(group1,group2)
temp = t(sapply(group.total, function(x)
unlist(t.test(x~type)[c("p.value")])))
temp = as.vector(temp)
t.test.pvals[i,] = temp
}
You can either do a t-test or do permutation testing. In the permutation testing, you don't use t-tests. See for instance here for a tutorial on permutation testing. Below you find the code for your particular example (e.g. var5):
# t-test
with(df, t.test(var5~Group))$p.value
# Permutation testing
# mean difference
mean.diff <- with(df, abs(mean(var5[Group==1])-mean(var5[Group==2])))
# function that calculates resampled mean
one.test <- function(x,y) {
xstar<-sample(x)
abs(mean(y[xstar==1])-mean(y[xstar==2]))
}
# calculating the resampled means
many.diff <- c(mean.diff, with(df, replicate(1000, one.test(Group, var5))))
# pvalue
p5 <- mean(abs(many.diff) >= abs(mean.diff))
p5
The way you did it, you resampled and then calculated p-values from a t-test. After the resampling, the p-value is uniformly distributed between 0 and 1. Therefore when you look at summary(t.test.pvals), you see uniformly distributed p-values (as expected).
#shadow explained the issue with your code well. If I were you I would generally refrain from coding this kind of thing from scratch. The coin package implements all the permutation tests you could ever want to use. No need to re-invent the wheel.
This code
library(coin)
sapply(df[,-1], function(x) pvalue(oneway_test(x ~ as.factor(df$Group))))
## var1 var2 var3 var4 var5
## 0.548 0.544 0.898 0.685 0.304
does what you seem to want to do (i.e., test whether there is a shift in the distribution of varX in Group 1 versus Group 2).

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Resources