I have a question about the Reduce function in R. I read its documentation, but I am still confused a bit. So, I have 5 vectors with genes name. For example:
v1 <- c("geneA","geneB",""...)
v2 <- c("geneA","geneC",""...)
v3 <- c("geneD","geneE",""...)
v4 <- c("geneA","geneE",""...)
v5 <- c("geneB","geneC",""...)
And I would like to find out which genes are present in at least two vectors. Some people have suggested:
Reduce(intersect,list(a,b,c,d,e))
I would greatly appreciate if someone could please explain to me how this statement works, because I have seen Reduce used in other scenarios.
Reduce takes a binary function and a list of data items and successively applies the function to the list elements in a recursive fashion. For example:
Reduce(intersect,list(a,b,c))
is the same as
intersect((intersect(a,b),c)
However, I don't think that construct will help you here as it will only return those elements that are common to all vectors.
To count the number of vectors that a gene appears in you could do the following:
vlist <- list(v1,v2,v3,v4,v5)
addmargins(table(gene=unlist(vlist), vec=rep(paste0("v",1:5),times=sapply(vlist,length))),2,list(Count=function(x) sum(x[x>0])))
vec
gene v1 v2 v3 v4 v5 Count
geneA 1 1 0 1 0 3
geneB 1 0 0 0 1 2
geneC 0 1 0 0 1 2
geneD 0 0 1 0 0 1
geneE 0 0 1 1 0 2
A nice way to see what Reduce() is doing is to run it with its argument accumulate=TRUE. When accumulate=TRUE, it will return a vector or list in which each element shows its state after processing the first n elements of the list in x. Here are a couple of examples:
Reduce(`*`, x=list(5,4,3,2), accumulate=TRUE)
# [1] 5 20 60 120
i2 <- seq(0,100,by=2)
i3 <- seq(0,100,by=3)
i5 <- seq(0,100,by=5)
Reduce(intersect, x=list(i2,i3,i5), accumulate=TRUE)
# [[1]]
# [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
# [20] 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74
# [39] 76 78 80 82 84 86 88 90 92 94 96 98 100
#
# [[2]]
# [1] 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
#
# [[3]]
# [1] 0 30 60 90
Assuming the input values given at the end of this answer, the expression
Reduce(intersect,list(a,b,c,d,e))
## character(0)
gives the genes that are present in all vectors, not the genes that are present in at least two vectors. It means:
intersect(intersect(intersect(intersect(a, b), c), d), e)
## character(0)
If we want the genes that are in at least two vectors:
L <- list(a, b, c, d, e)
u <- unlist(lapply(L, unique)) # or: Reduce(c, lapply(L, unique))
tab <- table(u)
names(tab[tab > 1])
## [1] "geneA" "geneB" "geneC" "geneE"
or
sort(unique(u[duplicated(u)]))
## [1] "geneA" "geneB" "geneC" "geneE"
Note: We used:
a <- c("geneA","geneB")
b <- c("geneA","geneC")
c <- c("geneD","geneE")
d <- c("geneA","geneE")
e <- c("geneB","geneC")
Related
I would like to find which row of a value that has the same value of another one?
Here is the code:
> a
[1] 3 5 6
> num
x y z
1 112 55 0
2 23 21 1
3 121 56 2
4 132 15 3
5 123 15 4
6 132 45 5
7 132 41 6
8 179 45 7
To find out on which row of num has the same value that a has, is there a function I can use like match function? I tried this code (it would not work):
for(i in 1:length(num)){
for (j in 1: length(a)){
if (num$z[i]==a[j]){
return(row(num[i]))
}
}
}
The outputs are the warning.
See code below. lapply returns a list and the list which_rows contains the rows in which a value of a appears in each column.
which_rows <- lapply(df1, function(x) which(x %in% a))
which_rows
$x
integer(0)
$y
integer(0)
$z
[1] 4 6 7
You were almost there with the match function:
match(a, num$z)
# [1] 4 6 7
To get the full rows of num satisfying z==a:
num[match(a,num$z),]
# x y z
# 132 15 3
# 132 45 5
# 132 41 6
or
num %>% filter(z %in% a)
# x y z
# 132 15 3
# 132 45 5
# 132 41 6
To get just the y:
num$y[match(a,num$z)]
# [1] 15 45 41
I would like to perform a rowSums based on specific values for multiple columns (i.e. multiple conditions). I know how to rowSums based on a single condition (see example below) but can't seem to figure out multiple conditions.
# rowSums with single, global condition
set.seed(100)
df <- data.frame(a = sample(0:100,10),
b = sample(0:100,10),
c = sample(0:100,10),
d = sample(0:100,10))
print(df)
a b c d
1 31 63 54 49
2 25 88 71 92
3 54 27 53 34
4 5 39 73 93
5 45 73 40 67
6 46 64 16 85
7 77 19 97 17
8 34 33 82 59
9 50 93 51 99
10 15 100 25 11
Single Condition Works
df$ROWSUMS <- rowSums(df[,1:4] <= 50)
# And produces
a b c d ROWSUMS
1 31 63 54 49 2
2 25 88 71 92 1
3 54 27 53 34 2
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 2
8 34 33 82 59 2
9 50 93 51 99 1
10 15 100 25 11 3
Multiple Conditions Don't Work
df$ROWSUMS_Multi <- rowSums(df[,1] <= 50 | df[,2] <= 25 | df[,3] <= 75)
Error in rowSums(df[, 1] <= 50 | df[, 2] <= 25 | df[, 3] <= 75) :
'x' must be an array of at least two dimensions
Desired Output
a b c d ROWSUMS_Multi
1 31 63 54 49 2
2 25 88 71 92 2
3 54 27 53 34 1
4 5 39 73 93 2
5 45 73 40 67 2
6 46 64 16 85 2
7 77 19 97 17 1
8 34 33 82 59 1
9 50 93 51 99 2
10 15 100 25 11 2
I could just be sub-setting incorrectly, but I haven't been able to find a fix.
One problem with [ while having a single row or single column is it coerces the data.frame to a vector. Based on ?Extract
x[i, j, ... , drop = TRUE]
NOTE, drop is TRUE by default
and later in the documentation
drop - For matrices and arrays. If TRUE the result is coerced to the lowest possible dimension (see the examples). This only works for extracting elements, not for the replacement. See drop for further details.
To avoid that either use drop = FALSE or simply drop the , which will return a single column data.frame because by default, the index without any comma is regarded as column index and not row index for data.frame
rowSums(df[1] <= 50 | df[2] <= 25 | df[3] <= 75)
Update
Based on the expected output, the rowSums can be written as
dfROWSUMS <- rowSums(df[1:3] <= c(50, 25, 75)[col(df[1:3])])
df$ROWSUMS
#[1] 2 2 1 2 2 2 1 1 2 2
NOTE: Earlier comment was based on why the rowSums didn't work. Didn't check the expected output earlier. Here, we need to do comparison of 3 columns with different values. When we do
df[1] <= 50
It is a single column of one TRUE/FALSE
When we do | with
df[1] <= 50 | df[2] <= 25
It would be still be a single column of TRUE/FALSE. Only difference is that we have replaced TRUE/FALSE or FALSE/TRUE in a row with TRUE. Similarly, it would be the case when we add n logical comparisons compared with |. Instead of that, do a +, does the elementwise sum
((df[1] <= 50)+ (df[2] <= 25) + (df[3] <= 75))[,1] # note it is a matrix
Here, we can do it with vector i.e. using , as well
((df[, 1] <= 50)+ (df[, 2] <= 25) + (df[, 3] <= 75)) # vector output
The only issue with this would be to repeatedly do the +. If we use rowSums, then make sure the comparison value replicated (col) to the same dimensions of the subset of data.frame. Another option is Map,
Reduce(`+`, Map(`<=`, df[1:3], c(50, 25, 75)))
We can also use cbind to create a matrix from the multiple conditions using column positions or column names then use rowSums like usual, e.g
> rowSums(cbind(df[,'a'] <= 50 ,df[,'b'] <= 25 ,df[,'c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
> rowSums(cbind(df['a'] <= 50 ,df['b'] <= 25 ,df['c'] <= 75), na.rm = TRUE)
[1] 2 2 1 2 2 2 1 1 2 2
Using dplyr
library(dplyr)
df %>% mutate(ROWSUMS=rowSums(cbind(.['a'] <= 50 ,.['b'] <= 25 ,.['c'] <= 75), na.rm = TRUE))
I have a data frame with distance in the first colomn and class in the second:
data.tab <- read.table(text = "
644 1
76 1
78 1
350 1
45 1
37 2
366 2
46 2
71 3
28 3
97 3
30 3
55 3
65 3
116 3
30 3
18 4
143 4
99 4")
I want to shape it into a new data frame by adding zero according to the longest class. The result will be:
data.tab <- read.table(text = "
1 644 76 78 350 45 0 0 0
2 37 366 46 0 0 0 0 0
3 71 28 97 30 55 65 116 30
4 18 143 99 0 0 0 0 0")
This essentially boils down to a simple long to wide reshape
library(tidyverse)
data.tab %>%
group_by(V2) %>%
mutate(col = paste0("V", 1:n())) %>%
spread(col, V1, fill = 0) %>%
ungroup()
## A tibble: 4 x 8
# V1 V2 V3 V4 V5 V6 V7 V8
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 644 76 78 350 45 0 0 0
#2 37 366 46 0 0 0 0 0
#3 71 28 97 30 55 65 116 30
#4 18 143 99 0 0 0 0 0
Using df as name instead of data.tab:
MAX <- max(table(df$V2))
t(sapply(split(df$V1, df$V2), function(x) c(x, rep(0, MAX-length(x)))))
(The idea is to split V1 into groups defined by V2, making the vectors equal in length by adding 0's at the end when necessary, and then combining that into a single matrix. sapply does the last bit automatically but columnwise, so t is needed.)
another way using length<-
U <- unstack(df) # a hack learned from G.Grothendieck's answer
U <- with(df, split(V1,V2)) # more readable version of the above
M <- max(lengths(U))
R <- t(sapply(U, "length<-", M)) # setting all lengths equal
replace(R, is.na(R), 0) # replacing NAs by zeroes
And a (rather unreadable) one-liner doing the same thing:
"[<-"(R<-t(sapply(U<-unstack(df),"length<-",max(lengths(U)))),is.na(R),0)
1) xtabs Using only base R create a sequence number column within class and then use xtabs to rearrange it into a table. Finally convert that to data frame. Omit the last line of code if a table is sufficient.
data.tab2 <- transform(data.tab, seq = ave(V2, V2, FUN = seq_along))
xt <- xtabs(V1 ~ V2 + seq, data.tab2)
as.data.frame.matrix(xt)
giving:
1 2 3 4 5 6 7 8
1 644 76 78 350 45 0 0 0
2 37 366 46 0 0 0 0 0
3 71 28 97 30 55 65 116 30
4 18 143 99 0 0 0 0 0
2) ts Another base R solution is to convert the elements of each class to a ts series giving tt a multivariate time series with NAs at the ends of the shorter ones. Convert those NAs to 0 in the second line of code and then convert that to a data frame in the last line.
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[] <- ifelse(is.na(tt), 0, tt)
as.data.frame(t(tt))
3) Using data.tab2 from (1) use tapply to create the matrix mat and then convert that to a data.frame. Omit the last line of code if a matrix is sufficient.
mat <- with(data.tab2, tapply(V1, list(V2, seq), c, default = 0))
as.data.frame(mat)
Note
A comment claimed ifelse would be slower than a suggested alternative but benchmarking it showed no overall difference on the data in the question. Of course performance may not be very important here in the first place.
library(rbenchmark)
benchmark(
ifelse = {
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[] <- ifelse(is.na(tt), 0, tt)
as.data.frame(t(tt))
},
replace = {
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[is.na(tt)] <- 0
as.data.frame(t(tt))
}
)[1:4]
giving:
test replications elapsed relative
1 ifelse 100 0.25 1
2 replace 100 0.25 1
using data.table's transpose
cbind(sort(unique(data.tab$V2)),do.call(rbind,transpose(transpose(split(data.tab$V1, data.tab$V2), 0))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 1 644 76 78 350 45 0 0 0
#[2,] 2 37 366 46 0 0 0 0 0
#[3,] 3 71 28 97 30 55 65 116 30
#[4,] 4 18 143 99 0 0 0 0 0
Here an example of my list (I actually have > 2,000 df in the real one):
df1 = read.table(text = 'a b
1 66
1 999
23 89', header = TRUE)
df2 = read.table(text = 'a b
99 61
32 99
83 19', header = TRUE)
lst = list(df1, df2)
I need to create a new column for each data.frame within the list and populate each column with a specific number.
numbers = c(100, 200)
so my output should be:
> lst
[[1]]
a b new_col
1 1 66 100
2 1 999 100
3 23 89 100
[[2]]
a b new_col
1 99 61 200
2 32 99 200
3 83 19 200
With lapply I was able to create a new blank column for each data.frame:
lst = lapply(lst, cbind, new_col = '')
> lst
[[1]]
a b new_col
1 1 66
2 1 999
3 23 89
[[2]]
a b new_col
1 99 61
2 32 99
3 83 19
But I don't know how to populate the columns with my vector of numbers.
Thanks
In order to iterate both the list of data.frames and vector of numbers at the same time, use Map(). For example
Map(cbind, lst, new_col=numbers)
# [[1]]
# a b new_col
# 1 1 66 100
# 2 1 999 100
# 3 23 89 100
#
# [[2]]
# a b new_col
# 1 99 61 200
# 2 32 99 200
# 3 83 19 200
I have a dataset (x) contains
1 10
20 30
34 38
59 83
...
I have a big matrix nx1. I want to assign a value 1 for each row in x. For example
mat[1:10,1] = 1
mat[20:30,1] = 1
etc...
In R, the size of x is quite big and takes a while to do the following:
for ( j in 1:dim(x)[1] ) {
mat[x[j,1]:x[j,2], 1] <- 1
}
Please help me if there is a faster way to do this. Thanks.
You can easily make a list of the rows you want to assign a value of 1 to in your big matrix, using apply on x with seq.int to get the row numbers like this...
rows <- unlist( apply( x , 1 , FUN = function(x){ seq.int(x[1],x[2])}) )
rows
# [1] 1 2 3 4 5 6 7 8 9 10 20 21 22 23 24 25 26 27 28 29 30 34 35 36 37 38 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
And then use subsetting which will be much faster, like this
mat[ rows , 1 ] <- 1
If m is your set of start and stop locations:
m <- matrix(scan(), ncol=2)
#------
1: 1 10
3: 20 30
5: 34 38
7: 59 83
9:
Read 8 items
mapply( seq.int, m[,1], m[,2])
rx1[ unlist( mapply( seq.int, m[,1], m[,2]) ), 1] <- 1
(Trivially different than SimonO101's earlier contribution.)
data.table usually excels in cases like this. Here is a data.table-based solution:
library(data.table)
indexes<-data.table(istart=c(1L,20L,34L,59L), istop=c(10L,30L,38L,83L))
mat<-data.table(val=sample(1L:1e5L,1e5))
mat[indexes[,list(i=seq(istart,istop)),by="istart"][,i],val:=1L]