grouping data with the same name and applying function - r

I have matrix like this:
I want to group the columns by which they have same name and apply function to the rows of my matrix.
>data
A A A B B C
gene1 1 6 11 16 21 26
gene2 2 7 12 17 22 27
gene3 3 8 13 18 23 28
gene4 4 9 14 19 24 29
gene5 5 10 15 20 25 30
basically, I want put columns with same names like A to group 1, B to group 2,... and after that, I calculate T-test for each genes for all groups.
can anybody help me how can I do this ? first : grouping, then applying the T-test, which return T score for each genes between different groups .

The OP hasn't mentioned what form they want in their output, but I'm entirely updating this answer with a possible solution.
First, some reproducible sample data to work with (that will actually work with t.test).
set.seed(1)
mymat <- matrix(sample(100, 40, replace = TRUE),
ncol = 8, dimnames = list(
paste("gene", 1:5, sep = ""),
c("A", "A", "A", "B", "B", "B", "C", "C")))
mymat
# A A A B B B C C
# gene1 27 90 21 50 94 39 49 67
# gene2 38 95 18 72 22 2 60 80
# gene3 58 67 69 100 66 39 50 11
# gene4 91 63 39 39 13 87 19 73
# gene5 21 7 77 78 27 35 83 42
I've left all the hard work to the combn function. Within the combn function, I've made use of the FUN argument to add a function that creates a vector of the t.test "statistic" by each row (I'm assuming one gene per row). I've also added an attribute to the resulting vector to remind us which columns were used in calculating the statistic.
temp <- combn(unique(colnames(mymat)), 2, FUN = function(x) {
out <- vector(length = nrow(mymat))
for (i in sequence(nrow(mymat))) {
out[i] <- t.test(mymat[i, colnames(mymat) %in% x[1]],
mymat[i, colnames(mymat) %in% x[2]])$statistic
}
attr(out, "NAME") <- paste(x, collapse = "")
out
}, simplify = FALSE)
The output of the above is a list of vectors. It might be more convenient to convert this into a matrix. Since we know that each value in a vector represents one row, and each vector overall represents one column value combination (AB, AC, or BC), we can use that for the dimnames of the resulting matrix.
DimNames <- list(rownames(mymat), sapply(temp, attr, "NAME"))
final <- do.call(cbind, temp)
dimnames(final) <- DimNames
final
# AB AC BC
# gene1 -0.5407966 -0.5035088 0.157386919
# gene2 0.5900350 -0.7822292 -1.645448267
# gene3 -0.2040539 1.7263502 1.438525163
# gene4 0.6825062 0.5933218 0.009627409
# gene5 -0.4384258 -0.9283003 -0.611226402
Some manual verification:
## Should be the same as final[1, "AC"]
t.test(mymat[1, colnames(mymat) %in% "A"],
mymat[1, colnames(mymat) %in% "C"])$statistic
# t
# -0.5035088
## Should be the same as final[5, "BC"]
t.test(mymat[5, colnames(mymat) %in% "B"],
mymat[5, colnames(mymat) %in% "C"])$statistic
# t
# -0.6112264
## Should be the same as final[3, "AB"]
t.test(mymat[3, colnames(mymat) %in% "A"],
mymat[3, colnames(mymat) %in% "B"])$statistic
# t
# -0.2040539
Update
Building on #EDi's answer, here's another approach. It makes use of melt from "reshape2" to convert the data into a "long" format. From there, as before, it's pretty straightforward subsetting work to get what you want. The output there is transposed in relation to the approach taken with the pure combn approach, but the values are the same.
library(reshape2)
mymatL <- melt(mymat)
byGene <- split(mymatL, mymatL$Var1)
RowNames <- combn(unique(as.character(mymatL$Var2)), 2,
FUN = paste, collapse = "")
out <- sapply(byGene, function(combos) {
combn(unique(as.character(mymatL$Var2)), 2, FUN = function(x) {
t.test(value ~ Var2, combos[combos[, "Var2"] %in% x, ])$statistic
}, simplify = TRUE)
})
rownames(out) <- RowNames
out
# gene1 gene2 gene3 gene4 gene5
# AB -0.5407966 0.5900350 -0.2040539 0.682506188 -0.4384258
# AC -0.5035088 -0.7822292 1.7263502 0.593321770 -0.9283003
# BC 0.1573869 -1.6454483 1.4385252 0.009627409 -0.6112264
The first option is considerably faster, at least on this smaller dataset:
microbenchmark(fun1(), fun2())
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 8.812391 9.012188 9.116896 9.20795 17.55585 100
# fun2() 42.754296 43.388652 44.263760 45.47216 67.10531 100

Related

How to compare elements of two large datasets as efficient as possible?

I am an R amateur and learning slowly. I present the situation:
I have two data frames with several columns (4) and +10000 rows looking like this, both:
df1: df2:
Nº x y attr Nº x y attr
1 45 34 X 1 34 23 x
1 48 45 XX 4 123 45 x
1 41 23 X 4 99 69 xx
4 23 12 X 4 112 80 xx
4 28 16 X 5 78 80 x
5 78 80 XXX 5 69 74 xx
...
I would like to compare both data frames based on x,y (coordinates) to delete in df1 all the values that also appear in df2 (all the values/coordinates that are contained in both datasets, delete them in df1).
So in my example, the last row of df1 would be deleted because the same coordinates are in df2.
What I am doing is using a double loop for(), one for one dataset and another one for the other, to compare one by one all the values possible.
I know this is extremely inefficient and it takes also a lot of time if I increase the amount of data.
What other ways are there to do this?
There are probably some functions but I generally don't know how to use them so much and it gives me problems.
Thank you very much!!
A library(data.table) method:
df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
# Nº x y attr
#1: 1 45 34 X
#2: 1 48 45 XX
#3: 1 41 23 X
#4: 4 23 12 X
#5: 4 28 16 X
Not the most elegant solution but gets the job done:
df2 = fread('Nº x y attr
1 34 23 x
4 123 45 x
4 99 69 xx
4 112 80 xx
5 78 80 x
5 69 74 xx')
df1 = fread('Nº x y attr
1 45 34 X
1 48 45 XX
1 41 23 X
4 23 12 X
4 28 16 X
5 78 80 XXX')
> df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),]
Nº x y attr
1: 1 45 34 X
2: 1 48 45 XX
3: 1 41 23 X
4: 4 23 12 X
5: 4 28 16 X
Explanation:
It's best to use vectorised functions rather than loops. !stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_") concatenates the x and y columns into a string and then finds elements from df1 that aren't in df2. This creates a logical vector of TRUE FALSE values which we can then use to subset df1.
EDIT:
I was curious to see if mine or #dww answer was faster:
> library(microbenchmark)
>
> n=100000
>
> df1 = data.table(x = sample(n), y=sample(n))
> df2 = data.table(x = sample(n), y=sample(n))
>
>
>
> microbenchmark(
... df1[!stringr::str_c(df1$x, df1$y, sep="_") %in% stringr::str_c(df2$x, df2$y, sep="_"),],
... df1[fsetdiff(df1[, .(x,y)] , df2[, .(x,y)] ), on=c('x','y')]
... )
Unit: milliseconds
expr
df1[!stringr::str_c(df1$x, df1$y, sep = "_") %in% stringr::str_c(df2$x, df2$y, sep = "_"), ]
df1[fsetdiff(df1[, .(x, y)], df2[, .(x, y)]), on = c("x", "y")]
min lq mean median uq max neval
168.40953 199.37183 219.30054 209.61414 222.08134 364.3458 100
41.07557 42.67679 52.34855 44.34379 59.27378 152.1283 100
Seems like the data.table version of dww is ~5x faster.
3 lines of code
#generate sample data
x1 <- sample(1:50,9001, T)
y1 <- sample(1:50,9001, T)
x2 <- sample(1:50,9001, T)
y2 <- sample(1:50,9001, T)
df1 <- data.frame(id =1:9001, x1,y1, stringsAsFactors = F)
df2 <- data.frame(id =1:9001, x2,y2, stringsAsFactors = F)
#add a match column to each dataframe
df1$match <- paste(df1$x1, df1$y1)
df2$match <- paste(df2$x2, df2$y2)
#overwrite df1 with the date of df1 that does not appear in df2
df1 <- df1[!df1$match %in% df2$match,]

R apply a vector of functions to a dataframe

I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------

merge tables in R, combine cells if in both

Hi can you please explain how I can merge two tables that they can be used to generate a piechart?
#read input data
dat = read.csv("/ramdisk/input.csv", header = TRUE, sep="\t")
# pick needed columns and count the occurences of each entry
df1 = table(dat[["C1"]])
df2 = table(dat[["C2"]])
# rename columns
names(df1) <- c("ID", "a", "b", "c", "d")
names(df2) <- c("ID", "e", "f", "g", "h")
# show data for testing purpose
df1
# ID a b c d
#241 18 17 28 29
df2
# ID e f g h
#230 44 8 37 14
# looks fine so far, now the problem:
# what I want to do ist merging df and df2
# so that df will contain the overall numbers of each entry
# df should print
# ID a b c d e f g h
#471 18 17 28 29 44 8 37 14
# need them to make a nice piechart in the end
#pie(df)
I assume it can be done with merge somehow, but I haven't found the right way. The closest solution I found was merge(df1,df2,all=TRUE), but it wasn't exactly what I've needed.
An approach would be to stack, then rbind and do an aggregate
out <- aggregate(values ~ ., rbind(stack(df1), stack(df2)), sum)
To get a named vector
with(out, setNames(values, ind))
Or another approach is to concatenate the tables and then use tapply to do a group by sum
v1 <- c(df1, df2)
tapply(v1, names(v1), sum)
Or with rowsum
rowsum(v1, group = names(v1))
Another approach would be to use rbindlist from data.table and colSums to get the totals. rbindlist with fill=TRUE accepts all columns, even if they are not present in both tables.
df1<-read.table(text="ID a b c d
241 18 17 28 29 ",header=TRUE)
df2<-read.table(text="ID e f g h
230 44 8 37 14" ,header=TRUE)
library(data.table)
setDT(df1)
setDT(df2)
res <- rbindlist(list(df1,df2), use.names=TRUE, fill=TRUE)
colSums(res, na.rm=TRUE)
ID a b c d e f g h
471 18 17 28 29 44 8 37 14
I wrote the package safejoin that handle this type of tasks in an intuitive way (I hope!). You just need to have a common id between your 2 tables (we'll use tibble::row_id_to_column for that) and then you can merge and handle the column conflict with sum.
Using #pierre-lapointe's data :
library(tibble)
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
res <- safe_inner_join(rowid_to_column(df1),
rowid_to_column(df2),
by = "rowid",
conflict = sum)
res
# rowid ID a b c d e f g h
# 1 1 471 18 17 28 29 44 8 37 14
The for a given row (here the first and only), you can get your pie chart by converting to a vector with unlist and removing the irrelevant 2 first elements :
pie(unlist(res[1,])[-(1:2)])

Sum over an increasing number of columns of a data frame in R

I need to extract summed subsets of a data.frame row-by-row and use the output to return a new data.frame. However, I want to increase the number of columns to sum across by 4 each time. So, for example, I want to extract the 1st column by itself, then the sum of columns 2 to 6 on a row-by-row basis, then columns 7 to 15 and so on.
I have this code that returns the sum of a constant number of columns across a data.frame (by a maximum number of trials) into a new data.frame- I just need to find a way to add the escalating function.
t<- max(as.numeric(df[,c(5)]))
process.row <- function (x){
sapply(1:t,function(i){
return(sum(as.numeric(x[c((6+(i-1)*5):(10+(i-1)*5))]
)
)
)
})
}
t(apply(df,1,process.row)) -> collated.data
I've been really struggling with a way to do this so thanks very much for any help. I couldn't find an answer to this elsewhere so apologies if I've missed something.
I was thinking you wanted to sum the rows of the selected subset of columns. If so, perhaps this will help.
# fake data
mydf <- as.data.frame(matrix(sample(45*5), nrow=5))
mydf
# prepare matrix of start and ending columns
n <- 20
i <- 1:n
ncols <- 1 + (i-1)*4
endcols <- cumsum(ncols)
startcols <- c(1, cumsum(ncols[-length(endcols)])+1)
mymat <- cbind(endcols, startcols)
# function to sum the rows
myfun <- function(df, m) {
# select subset with end columns within the dimensions of the given df
subm <- m[m[, 2] <= dim(df)[2], ]
# sum up the selected columns of df by rows
sapply(1:dim(subm)[1], function(j)
rowSums(df[, subm[j, 1]:subm[j, 2], drop=FALSE]))
}
mydf
myfun(df=mydf, m=mymat)
What you are looking for is a function that gives x (the lower value of the series), which looks like this for the sequence-part i:
In r, the code looks like this:
# the foo part of the function
foo <- function(x) ifelse(x > 0, 1 + (x - 1) * 4, 0)
# the wrapper of the function
min.val <- function(i){
ifelse(i == 1, 1, 1 + sum(sapply(1:(i - 1), foo)))
}
# takes only one value
min.val(1)
# [1] 1
min.val(2)
# [1] 2
min.val(3)
# [1] 7
# to calculate multiple values, use it like this
sapply(1:5, min.val)
#[1] 1 2 7 16 29
If you want to get the maximum number, you can create another function, which looks like this
max.val <- function(i) min.val(i + 1) - 1
sapply(1:5, max.val)
#[1] 1 6 15 28 45
Testing:
# creating a series to test it
series <- 1:20
min.vals <- sapply(series, min.val)
max.vals <- sapply(series, max.val)
dat <- data.frame(min = min.vals, max = max.vals)
# dat
# min max
# 1 1 1
# 2 2 6
# 3 7 15
# 4 16 28
# 5 29 45
# 6 46 66
# 7 67 91
# 8 92 120
# 9 121 153
# 10 154 190
# 11 191 231
# 12 232 276
# 13 277 325
# 14 326 378
# 15 379 435
# 16 436 496
# 17 497 561
# 18 562 630
# 19 631 703
# 20 704 780
Does that give you what you want?

detect outliers in a group and outlier in the single data

Car 100 200 300
Group1 34 35 34
Group1 57 67 34
Group1 68 76 6
Group2 45 23 23
I have some problems while detecting outliers in my dataframe. I want to detect if there is a complete vector (one row) an outlier of the corresponding group vectors (rows one-three)for each group. Further i want to detect if there is an outlier in one specific row. For this problem i found this solution but with this code i have to repeat the whole code for every single row and check the table for an "TRUE". Is there an outomatisation possible? e.g. creating a matrix of all outputs so i just have to check >sum(matrix==TRUE)
The code:
x=as.numeric(data_without[1,1:400])
grubbs.flag <- function(x) {
outliers <- NULL
test <- x
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
while(pv < 0.05) {
outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
test <- x[!x %in% outliers]
grubbs.result <- grubbs.test(test)
pv <- grubbs.result$p.value
}
return(data.frame(X=x,Outlier=(x %in% outliers)))
}
grubbs.flag(x)
X Outlier
1 0.1157 FALSE
2 0.1152 FALSE
3 0.1163 FALSE
4 0.1165 FALSE
I've read the object documentation and the default option just checks if there is a single outlier given data. Therefore I consider it suffices to run the test only once per each group.
First the data is split by group and then test is done recursively for each group. Only p-value and description is returned at the end to see which is the outlier if any - it'd be easy to identify which is the outlier as it'll be either the maximum or minimum value.
library(outliers)
df <- t(data.frame(car = c(100,200,300),
g1 = c(34,35,34),
g1 = c(57,67,34),
g1 = c(68, 76, 6),
g2 = c(45, 23, 23)))
row.names(df) <- c("car", "group1", "group1", "group1", "group2")
lst <- lapply(1:length(unique(row.names(df))), function(x) {
df[row.names(df)==unique(row.names(df))[x],]
})
lst
[[1]]
[1] 100 200 300
[[2]]
[,1] [,2] [,3]
group1 34 35 34
group1 57 67 34
group1 68 76 6
[[3]]
[1] 45 23 23
lapply(lst, function(x) {
tst <- grubbs.test(x)
c(tst$p.value, tst$alternative)
})
[[1]]
[1] "0.5" "highest value 300 is an outlier"
[[2]]
[1] "0.244875529263511" "lowest value 6 is an outlier"
[[3]]
[1] "0" "highest value 45 is an outlier"

Resources