Extracting matrix rows conditional on their rowsum? - r

In a matrix, how do I determine the rows that have the largest rowsums. For example, in the following matrix:
- A P S T
- 1 0 0 0 0
A 0 0 0 0 1
C 0 0 0 1 0
P 0 2 0 2 0
S 0 0 0 23 3
T 0 0 1 0 0
rows S & P have the two largest rowsums.

There's no need to use the names, you could easily do :
> Rsum <- rowSums(mat)
> mat[tail(order(Rsum),2),]
- A P S T
P 0 2 0 2 0
S 0 0 0 23 3

You could do this:
# Build your example matrix
mat = matrix( data=c( 1,0,0,0,0, 0,0,0,0,1, 0,0,0,1,0, 0,2,0,2,0, 0,0,0,23,3, 0,0,1,0,0 ), ncol=5, byrow=T )
rownames( mat ) = c( '-', 'A', 'C', 'P', 'S', 'T' )
colnames( mat ) = c( '-', 'A', 'P', 'S', 'T' )
# Get the sums
sums = rowSums( mat )
# Get the top 2 row names
top.names = names( sums[ order( sums, decreasing=TRUE ) ][ 1:2 ] )
# Filter the original matrix to include just these two
mat[ rownames( mat ) %in% top.names, ]
Which outputs
- A P S T
P 0 2 0 2 0
S 0 0 0 23 3

Paste your matrix in an easily reproducible format for others checking your question:
m <- structure(list(X. = c(1L, 0L, 0L, 0L, 0L, 0L), A = c(0L, 0L,
0L, 2L, 0L, 0L), P = c(0L, 0L, 0L, 0L, 0L, 1L), S = c(0L, 0L,
1L, 2L, 23L, 0L), T = c(0L, 1L, 0L, 0L, 3L, 0L)), .Names = c("X.",
"A", "P", "S", "T"), class = "data.frame", row.names = c("-",
"A", "C", "P", "S", "T"))
You could get it with dput, e.g.: dput(YourMatrix). This could be useful for your future questions :)
Back to the question - sort the rowSums and get the names, via:
t <- names(sort(rowSums(m)))
Get the first two:
> t[(length(t)-1):length(t)]
[1] "T" "S"
Or get the wanted rows by:
> d[t[(length(t)-1):length(t)],]
T S
- 0 0
A 1 0
C 0 1
P 0 2
S 3 23
T 0 0

Related

Replacing a subset of table values with a list of vectors

I have a table as follows:
tableA <- structure(c(1L, 0L, 0L, 0L, 0L, 6L, 0L, 6L, 0L, 3L, 0L, 0L, 0L, 0L, 1L), dim = c(3L,
5L), dimnames = structure(list(c("X", "Y",
"Z"), c("A", "B", "C","D", "E")), names = c("", "")), class = "table")
A B C D E
X 1 0 0 3 0
Y 0 0 6 0 0
Z 0 6 0 0 1
And a list of vectors as follows:
listB <- list(
"X" = c(0, 0, 0, 4, 0),
"Y" = c(0, 0, 5, 0, 0),
"Z" = c(0, 7, 0, 0, 0))
I would like to replace all values in table A, of columns B,C, and D, that are bigger than zero, with the corresponding values of list B.
Desired output:
A B C D E
X 1 0 0 4 0
Y 0 0 5 0 0
Z 0 7 0 0 1
Is there any way to copy these values in such manner?
You could do the following
cols2replace = match(c('B', 'C', 'D'), colnames(tableA))
cells2replace = tableA[, cols2replace] > 0
tableB = matrix(unlist(listB), nrow = 3, byrow = TRUE)
tableA[, cols2replace][cells2replace] = tableB[, cols2replace][cells2replace]
> tableA
A B C D E
X 1 0 0 4 0
Y 0 0 5 0 0
Z 0 7 0 0 1
As #Chris rightly pointed out the prior logic on the matrix.data.frame() couldn't use the column index vector, adjusted below:
Base R solution:
dfA <- as.data.frame.matrix(tableA)
ir <- data.frame(
Map(function(x, y){ifelse(x > 0, y, x)},
dfA,
setNames(
data.frame(t(as.data.frame(listB))),
names(tableA)
)
),
row.names = row.names(tableA)
)
col_idx <- !grepl(paste0(c('^B$', '^C$', '^D$'), collapse = "|"), colnames(ir))
ir[,col_idx] <- dfA[,col_idx]

Count Number of Pairwise Differences of a Matrix in R

I have the following matrix:
0 1 0 0 0 1 0 0 # Row A
0 1 0 0 0 0 1 0 # Row B
0 1 0 0 0 0 0 0 # Row C
0 0 1 0 0 0 0 0 # Row D
I want to make a new matrix that shows the pairwise difference between each row (e.g between rows A and B, there are 2 columns that are different, so the entry in the matrix corresponding to A and B is 2). Like this:
A B C D
A - 2 1 3
B - - 1 3
C - - - 2
D - - - -
The matrix isn't absolutely necessary. It's just an intermediary step for what I really want to do: count the number of pairwise differences between each row in the original matrix like so...
(2+1+3+1+3+2) = 12
You could try combn
v1 <- combn(1:nrow(m1), 2, FUN=function(x) sum(m1[x[1],]!= m1[x[2],]))
v1
#[1] 2 1 3 1 3 2
sum(v1)
#[1] 12
If you need a matrix output
m2 <- outer(1:nrow(m1), 1:nrow(m1), FUN=Vectorize(function(x,y)
sum(m1[x,]!=m1[y,])))
dimnames(m2) <- rep(list(LETTERS[1:4]),2)
m2[lower.tri(m2)] <- 0
m2
# A B C D
#A 0 2 1 3
#B 0 0 1 3
#C 0 0 0 2
#D 0 0 0 0
data
m1 <- structure(c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L), .Dim = c(4L, 8L))
I think that this function could help you to count the differences
count.diff <- function(mat) {
Nrow <- nrow(mat)
count <- 0
for (i in 1:(Nrow-1)) count <- count + sum(t(t(mat[-(1:i),])!=mat[i,]))
count
}
mat <- matrix(rbinom(n=24,size=1,prob=0.7), ncol=4)
mat
count.diff(mat)

How to extract value of a column based on multiple other columns

I have a dataframe which looks like this:
>head(df)
chrom pos strand ref alt A_pos A_neg C_pos C_neg G_pos G_neg T_pos T_neg
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283161 - G A 3 1 2 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283313 - G C 0 0 0 0 0 0 0 0
chr1 2283896 - G A 0 0 0 0 0 0 0 0
chr1 2283896 + G A 0 0 0 0 0 0 0 0
I want to extract the value from columns 6:13 (A_pos...T_neg) based on the value of the columns 'strand', 'ref' and 'alt'. For instance, in row1: strand = '-', ref = 'G' and alt = 'A', so I should extract the values from G_neg and A_neg. Again, in row6: stand = '+', ref = 'G' and alt = 'A', so I should get the values from G_pos and A_pos. I basically intend to do a chi-square test after extracting these values (These are my observed values, I have another set of expected values) but that is another story.
So the logic is somewhat like:
if(df$strand=="+")
do
print:paste(df$ref,"pos",sep="_") #extract value in column df$ref_pos
print:paste(df$alt,"pos",sep="_") #extract value in column df$alt_pos
else if(gt.merge$gene_strand=="-")
do
print:paste(df$ref,"neg",sep="_") #extract value in column df$ref_neg
print:paste(df$alt,"neg",sep="_") #extract value in column df$alt_neg
Here, I am trying to use paste on the values in 'ref' and 'alt' to get the desired column names. For instance, if strand ='+' and ref = 'G', it will fetch value from column G_pos.
The data frame is actually large and so I ruled out using for-loops. I am not sure how else can I do this to make the code as efficient as possible. Any help/suggestions would be appreciated.
Thanks!
Another alternative that looks valid, at least with the sample data:
tmp = ifelse(as.character(DF$strand) == "-", "neg", "pos")
sapply(DF[c("ref", "alt")],
function(x) as.integer(DF[cbind(seq_len(nrow(DF)),
match(paste(x, tmp, sep = "_"), names(DF)))]))
# ref alt
#[1,] 0 1
#[2,] 0 1
#[3,] 0 0
#[4,] 0 0
#[5,] 0 0
#[6,] 0 0
Where DF:
DF = structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"),
pos = c(2283161L, 2283161L, 2283313L, 2283313L, 2283896L,
2283896L), strand = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("-",
"+"), class = "factor"), ref = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "G", class = "factor"), alt = structure(c(1L,
1L, 2L, 2L, 1L, 1L), .Label = c("A", "C"), class = "factor"),
A_pos = c(3L, 3L, 0L, 0L, 0L, 0L), A_neg = c(1L, 1L, 0L,
0L, 0L, 0L), C_pos = c(2L, 2L, 0L, 0L, 0L, 0L), C_neg = c(0L,
0L, 0L, 0L, 0L, 0L), G_pos = c(0L, 0L, 0L, 0L, 0L, 0L), G_neg = c(0L,
0L, 0L, 0L, 0L, 0L), T_pos = c(0L, 0L, 0L, 0L, 0L, 0L), T_neg = c(0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("chrom", "pos", "strand",
"ref", "alt", "A_pos", "A_neg", "C_pos", "C_neg", "G_pos", "G_neg",
"T_pos", "T_neg"), class = "data.frame", row.names = c(NA, -6L
))
Not very elegant, but does the job:
strand.map <- c("-"="_neg", "+"="_pos")
cbind(
df[1:5],
do.call(
rbind,
lapply(
split(df[-(1:2)], 1:nrow(df)),
function(x)
c(
ref=x[-(1:2)][, paste0(x[[2]], strand.map[x[[1]]])],
alt=x[-(1:2)][, paste0(x[[3]], strand.map[x[[1]]])]
) ) ) )
We cycle through each row in your data frame and apply a function that pulls the value based on strand, ref, and alt. This produces:
chrom pos strand ref alt ref alt
1 chr1 2283161 - G A 0 1
2 chr1 2283161 - G A 0 1
3 chr1 2283313 - G C 0 0
4 chr1 2283313 - G C 0 0
5 chr1 2283896 - G A 0 0
6 chr1 2283896 + G A 0 0
An alternate approach is to use melt, but the format of your data makes it rather annoying because we need two melts in a row, and we need to create a unique id column so we can reconstitute the data frame once we're done computing.
df$id <- 1:nrow(df)
df.mlt <-
melt(
melt(df, id.vars=c("id", "chrom", "pos", "strand", "ref", "alt")),
measure.vars=c("ref", "alt"), value.name="base",
variable.name="alt_or_ref"
)
dcast(
subset(df.mlt, paste0(base, strand.map[strand]) == variable),
id + chrom + pos + strand ~ alt_or_ref,
value.var="value"
)
Which produces:
id chrom pos strand ref alt
1 1 chr1 2283161 - 0 1
2 2 chr1 2283161 - 0 1
3 3 chr1 2283313 - 0 0
4 4 chr1 2283313 - 0 0
5 5 chr1 2283896 - 0 0
6 6 chr1 2283896 + 0 0
Another way
testFunc <- function(x){
posneg <- if(x["strand"] == "-") {"neg"} else {"pos"}
cbind(as.numeric(x[paste0(x["ref"],"_",posneg)]), as.numeric(x[paste0(x["alt"],"_",posneg)]))
}
temp <- t(apply(df, 1, testFunc))
colnames(temp) <- c("ref", "alt")
using the [very] fast data.table library:
library(data.table)
df = fread('df.txt') # fastread
df[,ref := ifelse(strand == "-",
paste(ref,"neg",sep = "_"),
paste(ref,"pos",sep = "_"))]
df[,alt := ifelse(strand == "-",
paste(alt,"neg",sep = "_"),
paste(alt,"pos",sep = "_"))]
df[,strand := NULL] # not required anymore
dfm = melt(df,
id.vars = c("chrom","pos","ref","alt"),
variable.name = "mycol", value.name = "value")
dfm[mycol == ref | mycol == alt,] # matching

R: Combine rows in same data.frame [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have a simple R problem, but I just can't find the answer.
I have a dataframe like this:
A 1 0 0 0 0 0
B 0 1 0 0 0 0
B 0 0 1 0 0 1
B 0 0 0 0 1 0
C 1 0 0 0 0 0
C 0 0 0 1 1 0
And i want it to be just like this:
A 1 0 0 0 0 0
B 0 1 1 0 1 1
C 1 0 0 1 1 0
Thank you very much!
Regards Lisanne
Here's one possbility using tapply:
cbind(unique(dat[1]), do.call(rbind, tapply(dat[-1], dat[[1]], colSums)))
# V1 V2 V3 V4 V5 V6 V7
# 1 A 1 0 0 0 0 0
# 2 B 0 1 1 0 1 1
# 5 C 1 0 0 1 1 0
where dat is the name of your data frame.
dat <- structure(list(V1 = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), V2 = c(1L, 0L, 0L, 0L, 1L, 0L),
V3 = c(0L, 1L, 0L, 0L, 0L, 0L), V4 = c(0L, 0L, 1L, 0L, 0L,
0L), V5 = c(0L, 0L, 0L, 0L, 0L, 1L), V6 = c(0L, 0L, 0L, 1L,
0L, 1L), V7 = c(0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), class = "data.frame", row.names = c(NA,
-6L))
You could...
aggregate(.~ V1 , data =dat, sum)
or
library(plyr)
ddply(dat, .(V1), function(x) colSums(x[,2:7]) )
If you're working with a data.frame where there are duplicates but you only want the presence or absence of a 1 to be noted, then after these functions you might want to do something like dat[!(dat %in% c(1,0)] <- 1.
A possibility not mentioned is the aggregate function. I think this is quite 'readable'.
aggregate(cbind(data$X1, data$X2, data$X3, data$X4),
by = list(category = data$group), FUN = sum)

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Resources