So I'm trying to do a pairwise table and retain a p-value of each pair.
Please be noted that I'm still a beginner to R.
My data looks like this (though much bigger):
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
Where b1,b2,b3 should be compared to c1,c2,c3 for each row (for each ID in column a).
The end result should be something like:
a <- cbind(a,Adjusted_P_Values)
Where the head(a,1) would look like:
head(a,1)
a b1 b2 b3 c1 c2
1 ID1 0.1337694 0.7347543 0.5808391 0.4324976 0.5378458
c3 Adjusted_P_value
1 0.6368778 0.99
where each row has its corresponding P-value.
A function I have found which I think could do the trick is pairwise.t.test.
(Currently, I'm just running a loop for each row and doing a normal t-test and then correct them with p.adjust, but I can't do pooled sd---which I would like.)
So my issue now is how I construct the data so that R likes it. I can use melt.data.framefrom reshape2 library, but it wont give me the correct structure.
I use it like this:
Test_Data <- melt(df, "a", c("b1","b2","b3","c1","c2","c3"))
But I loose the row symmetry.
As, when I now do pairwise.t.test I have to use either the "a" column or the "variable" column created by melt, hence I either get a comparison between the replicates or between the IDs.
So, simply my question is:
how do I structure the data so that each row is tested and I get a p-value for each row, and where each treatment (b or c) has a standard deviation based on all the rows (one sd for all b's and one for all c's)?
I have been googeling a lot looking for similar problems (and tutorials on pairwise.t.test) but without success.
My approach was slightly different than the other answer, spreading the data into two columns, b and c by time measure (1 - 3), and then using t.test(...,paired=TRUE) to conduct a pairwise t-test.
set.seed(1234)
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
library(tidyr)
library(dplyr)
df %>%
gather(.,key="variable",value="value",-a) %>%
extract(.,variable,into = c("measure", "time"),
regex = "([A-Za-z]+)([0-9]+)") %>%
spread(.,measure,value) -> spreadData
# split by ID to conduct paired t-tests by ID
dataList <- split(spreadData,spreadData$a)
pValues <- unlist(lapply(dataList,function(x){
t.test(x$b,x$c,paired=TRUE)$p.value
}))
df$p.value <- pValues
df
...and the output:
> df
a b1 b2 b3 c1 c2
1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855
2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755
3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817
4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505
5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026
c3 p.value
1 0.4560915 0.3391364
2 0.2651867 0.5043753
3 0.3046722 0.4598274
4 0.5073069 0.6764142
5 0.1810962 0.1178471
>
NOTE: if one modifies the code from the other answer to include paired=TRUE argument, the p-values across the two solutions match.
Alternative approach: run t-test on difference between c and b
Given the commentary on this post about pairwise t-tests, I thought I'd illustrate what's happening with a pairwise test. Essentially for each time period 1 - 3, we subtract the b value from the c value, and run a t-test on the difference. Since we've reduced the data to a single column, there's no need for the paired= argument, but the test produces the same results as passing 2 columns with the paired=TRUE argument to t.test().
# alternative 2: subtract b from c and use regular t-test
# to show how pairwise works
spreadData$difference <- spreadData$c - spreadData$b
dataList <- split(spreadData,spreadData$a)
pValues <- unlist(lapply(dataList,function(x){
t.test(x$difference)$p.value
}))
df$p.value <- pValues
df
...and the output:
> spreadData$difference <- spreadData$c - spreadData$b
> dataList <- split(spreadData,spreadData$a)
> pValues <- unlist(lapply(dataList,function(x){
+ t.test(x$difference)$p.value
+ }))
> df$p.value <- pValues
> df
a b1 b2 b3 c1 c2
1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855
2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755
3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817
4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505
5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026
c3 p.value
1 0.4560915 0.3391364
2 0.2651867 0.5043753
3 0.3046722 0.4598274
4 0.5073069 0.6764142
5 0.1810962 0.1178471
>
A possible solution using the tidyverse package.
First, adjust the format of the data frame to the following structure.
library(tidyverse)
df2 <- df %>%
gather(Column, Value, -a) %>%
extract(Column, into = c("Group", "Number"), regex = "([A-Za-z]+)([0-9]+)")
df2
# a Group Number Value
# 1 ID1 b 1 0.640310605
# 2 ID2 b 1 0.009495756
# 3 ID3 b 1 0.232550506
# 4 ID4 b 1 0.666083758
# 5 ID5 b 1 0.514251141
# 6 ID1 b 2 0.693591292
# 7 ID2 b 2 0.544974836
# 8 ID3 b 2 0.282733584
# 9 ID4 b 2 0.923433484
# 10 ID5 b 2 0.292315840
# 11 ID1 b 3 0.837295628
# 12 ID2 b 3 0.286223285
# 13 ID3 b 3 0.266820780
# 14 ID4 b 3 0.186722790
# 15 ID5 b 3 0.232225911
# 16 ID1 c 1 0.316612455
# 17 ID2 c 1 0.302693371
# 18 ID3 c 1 0.159046003
# 19 ID4 c 1 0.039995918
# 20 ID5 c 1 0.218799541
# 21 ID1 c 2 0.810598552
# 22 ID2 c 2 0.525697547
# 23 ID3 c 2 0.914658166
# 24 ID4 c 2 0.831345047
# 25 ID5 c 2 0.045770263
# 26 ID1 c 3 0.456091482
# 27 ID2 c 3 0.265186672
# 28 ID3 c 3 0.304672203
# 29 ID4 c 3 0.507306870
# 30 ID5 c 3 0.181096208
Second, split the data frame and conduct pairwise.t.test, and then extract the P values.
p_value <- df2 %>%
split(.$a) %>%
map(function(x) pairwise.t.test(x$Value, x$Group, paired = TRUE)) %>%
map_dbl("p.value")
p_value
# ID1 ID2 ID3 ID4 ID5
# 0.3391364 0.5043753 0.4598274 0.6764142 0.1178471
Finally, added the P values to the original data frame as a new column.
df_final <- df %>% mutate(Adjusted_P_value = p_value)
df_final
# a b1 b2 b3 c1 c2 c3 Adjusted_P_value
# 1 ID1 0.640310605 0.6935913 0.8372956 0.31661245 0.81059855 0.4560915 0.3391364
# 2 ID2 0.009495756 0.5449748 0.2862233 0.30269337 0.52569755 0.2651867 0.5043753
# 3 ID3 0.232550506 0.2827336 0.2668208 0.15904600 0.91465817 0.3046722 0.4598274
# 4 ID4 0.666083758 0.9234335 0.1867228 0.03999592 0.83134505 0.5073069 0.6764142
# 5 ID5 0.514251141 0.2923158 0.2322259 0.21879954 0.04577026 0.1810962 0.1178471
DATA
set.seed(1234)
a <- factor(c("ID1","ID2","ID3","ID4","ID5"))
b <- runif(5)
b1 <- runif(5)
b2 <- runif(5)
b3 <- runif(5)
c1 <- runif(5)
c2 <- runif(5)
c3 <- runif(5)
df <- data.frame(a,b1,b2,b3,c1,c2,c3)
Edit:
In order to correctly map the P-values back onto the data frame, the data frame has to be ordered on the 'a' column.
Just add to Baraliuh solution:
map_dbl("p.value") does not work, however,
map_df( "p.value") work in my case
Related
I have a data frame with two columns "A" and "B". I created a function that works as mentioned below:
If X (user entered value) is found in column A, then return the X value found in column A and it's corresponding value in B column.
Here's my code:
myfunction <- function(x) {
r<- with(my_dataframe, my_dataframe[A %in% x, c("A", "B")])
return(data.frame(r))
}
I want to tweak this in such a way that if user input (value for X) doesn't appear in column A, return that value and NA for column B.
Example:
A B
1 A12
2 F1222
If the values for X are 1, 5. I want the output to look like this --
1 A12
5 NA
One approach could be to first find matched rows using condition as matched = my_dataframe$A==x.
Now, there are any matched rows found use matched value to return corresponding rows. Otherwise create a row with NA value for B.
myfunction <- function(x) {
r <- data.frame()
matched = my_dataframe$A %in% x
if(sum(matched) > 0){
r<- with(my_dataframe, my_dataframe[matched, c("A", "B")])
} else{
r<-data.frame(A = x, B = NA)
}
return(r)
}
#Test
myfunction(2)
# A B
# 2 2 A34
myfunction(11)
# A B
# 1 11 NA
Edited: Based on latest feedback from OP, I think dplyr::left_join will do the trick for him as:
a <- 1
dplyr::left_join(data.frame(A=a), my_dataframe, by="A")
# A B
# 1 1 A21
a <- c(2,3,12,34,45)
dplyr::left_join(data.frame(A=a), my_dataframe, by="A")
# A B
# 1 2 A34
# 2 3 D345
# 3 12 <NA>
# 4 34 <NA>
# 5 45 <NA>
Data
my_dataframe <- data.frame(A = 1:4,
B=c("A21", "A34", "D345", "E45"), stringsAsFactors = FALSE)
myfunction <- function(x) {
r<- with(my_dataframe, my_dataframe[A %in% x, c("A", "B")])
if(!nrow(r)) data.frame(A=x,B=NA) else data.frame(r)
}
> myfunction(3)
A B
1 3 NA
> myfunction(2)
A B
2 2 F1222
edit to allow vectors:
my=function(x){
s=subset(data,A==x)
m=x%in%s$A
if(all(m)) s else rbind(s,cbind(A=x[!m],B=NA))
}
my(1)
A B
1 1 A12
> my(1:10)
A B
1 1 A12
2 2 F1222
3 3 <NA>
4 4 <NA>
5 5 <NA>
6 6 <NA>
7 7 <NA>
8 8 <NA>
9 9 <NA>
10 10 <NA>
> my(4)
A B
1 4 NA
my(c(1,3.11))
A B
1 1.00 A12
2 3.11 <NA>
I'm new to this but I'm pretty sure this question hasn't been answered, or I'm just not good at searching....
I would like to subtract the values in multiple rows from a particular row based on matching columns and values. My actual data will be a large matrix with >5000 columns, eaching needing to be subtracted by a blank value that matches the a value in a factor column.
Here is an example data table:
c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb
I would like to subtract the c2,c3,and c4 values of c1 ="Blank" row from A,B,and C using the c5 factor to define which Blank values are used (aa or bb). I would like the "Blank" values to be subtracted from all rows sharing c5 info.
(i know this is confusing to describe)
So the results would look like this:
c1 c2 c3 c4 c5
r1 A -1 -1 -1 aa
r2 B -1 -1 -1 bb
r3 C 1 1 1 aa
r4 D 1 -3 1 bb
I've seen the ddply function work for doing something like this with a single column, but I wasn't able to expand that to perform this task for multiple columns. I'm a noob though...
Thank you for your help!
This is not tested for all possible cases, but should give you an idea:
df <- read.table(text =
"c1 c2 c3 c4 c5
r1 A 1 2 3 aa
r2 B 2 3 4 bb
r3 C 3 4 5 aa
r4 D 4 1 6 bb
r5 Blank 2 3 4 aa
r6 Blank 3 4 5 bb", header = T)
library(data.table)
# separate dataset into two
dt <- data.table(df, key = "c5")
dt.blank <- dt[c1 == "Blank"]
dt <- dt[c1 != "Blank"]
# merge into resulting dataset
dt.res <- dt[dt.blank]
# update each column
columns.count <- ncol(dt)
for(i in 2:(columns.count-1)) {
dt.res[[i]] <- dt.res[[i]] - dt.res[[i + columns.count]]
}
# > dt.res
# c1 c2 c3 c4 c5 i.c1 i.c2 i.c3 i.c4
# 1: A -1 -1 -1 aa Blank 2 3 4
# 2: C 1 1 1 aa Blank 2 3 4
# 3: B -1 -1 -1 bb Blank 3 4 5
# 4: D 1 -3 1 bb Blank 3 4 5
First split your data, since there's no reason you have them in a single data structure. Then apply the function:
# recreate your data
df <- data.frame(rbind(c(1:3, "aa"), c(2:4, "bb"), c(3:5, "aa"), c(4,1,6, "bb"), c(2:4, "aa"), c(3:5, "bb")))
df[,1:3] <- apply(df[,1:3], 2, as.integer)
# split it
blank1 <- df[5,]
blank2 <- df[6,]
df <- df[1:4,]
for (i in 1:nrow(df)) {
if (df[i,4] == "aa") {df[i,1:3] <- df[i,1:3] - blank1[1:3]}
else {df[i,1:3] <- df[i,1:3] - blank2[1:3]}
}
There are a few different was to run the loop, including vectorizing. But this suffices. I'd also argue that there's no reason to keep the labels "aa" v "bb" in the initial data structure either, which would make this simpler; but it's your choice.
What I want to do is multiply all the values in column 1 of a data.frame by the first element in a vector, then multiply all the values in column 2 by the 2nd element in the vector, etc...
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
c1 c2 c3
1 1 4 7
2 2 5 8
3 3 6 9
v1 <- c(1,2,3)
So the result is this:
c1 c2 c3
1 1 8 21
2 2 10 24
3 3 12 27
I can do this one column at a time but what if I have 100 columns? I want to be able to do this programmatically.
Or simply diagonalize the vector, so that each row entry is multiplied by the corresponding element in v1:
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- as.matrix(cbind(c1,c2,c3))
v1 <- c(1,2,3)
d1%*%diag(v1)
[,1] [,2] [,3]
[1,] 1 8 21
[2,] 2 10 24
[3,] 3 12 27
Transposing the dataframe works.
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
v1 <- c(1,2,3)
t(t(d1)*v1)
# c1 c2 c3
#[1,] 1 8 21
#[2,] 2 10 24
#[3,] 3 12 27
EDIT: If all columns are not numeric, you can do the following
c1 <- c(1,2,3)
c2 <- c(4,5,6)
c3 <- c(7,8,9)
d1 <- data.frame(c1,c2,c3)
# Adding a column of characters for demonstration
d1$c4 <- c("rr", "t", "s")
v1 <- c(1,2,3)
#Choosing only numeric columns
index <- which(sapply(d1, is.numeric) == TRUE)
d1_mat <- as.matrix(d1[,index])
d1[,index] <- t(t(d1_mat)*v1)
d1
# c1 c2 c3 c4
#1 1 8 21 rr
#2 2 10 24 t
#3 3 12 27 s
We can also replicate the vector to make the lengths equal and then multiply
d1*v1[col(d1)]
# c1 c2 c3
#1 1 8 21
#2 2 10 24
#3 3 12 27
Or use sweep
sweep(d1, 2, v1, FUN="*")
Or with mapply to multiply the corresponding columns of 'data.frame' and elements of 'vector'
mapply(`*`, d1, v1)
Shown as below:
df <- data.frame(X1 = rep(letters[1:3],3),
X2 = 1:9,
X3 = sample(1:50,9))
df
ind<- grep("a|c", df$X1)
library(data.table)
df_ac <- df[ind,]
df_b <- df[!ind,]
df_ac is created using the regular grep command. If I want to use the grep the reverse way: to select all observations with X1 == 'b'.
I know I can do this by:
ind2<- grep("a|c", df$X1, invert = T)
df_b <-df[ind2,]
But, in my original script, why does the command df_b <-df[!ind,] return a data frame with zero observation?
Anyone can explain to me why my logic here is wrong? Is there any other way to select observations in a data.frame by using the grep reversely without specifying invert = T? Thank you!
You may be more interested in grepl instead of grep:
ind<- grepl("a|c", df$X1)
df[ind,]
# X1 X2 X3
# 1 a 1 16
# 3 c 3 38
# 4 a 4 10
# 6 c 6 18
# 7 a 7 33
# 9 c 9 49
df[!ind,]
# X1 X2 X3
# 2 b 2 5
# 5 b 5 14
# 8 b 8 50
Alternatively, go ahead an make use of "data.table" and try out %in% to see what else might work for you. Notice the difference in the syntax.
ind2 <- c("a", "c")
library(data.table)
setDT(df)
df[X1 %in% ind2]
# X1 X2 X3
# 1: a 1 16
# 2: c 3 38
# 3: a 4 10
# 4: c 6 18
# 5: a 7 33
# 6: c 9 49
df[!X1 %in% ind2]
# X1 X2 X3
# 1: b 2 5
# 2: b 5 14
# 3: b 8 50
I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))