connect two matrixes by columns and extract sub matrix - r

I have two matrixes (e.g., A and B). I would like to extract columns of B based on the order of A's first column:
For example
matrix A
name score
a 0.1
b 0.2
c 0.1
d 0.6
matrix B
a d b c g h
0.1 0.2 0.3 0.4 0.6 0.2
0.2 0.1 0.4 0.7 0.1 0.1
...
I want matrix B to look like this at the end
matrix B_modified
a b c d
0.1 0.3 0.4 0.2
0.2 0.4 0.7 0.1
Can this be done either in perl or R? thanks a lot in advance

I've no idea what problems you're facing. Here's how I've done it.
## get data as matrix
a <- read.table(header=TRUE, text="name score
a 0.1
b 0.2
c 0.1
d 0.6", stringsAsFactors=FALSE) # load directly as characters
b <- read.table(header=TRUE, text="a d b c g h
0.1 0.2 0.3 0.4 0.6 0.2
0.2 0.1 0.4 0.7 0.1 0.1", stringsAsFactors=FALSE)
a <- as.matrix(a)
b <- as.matrix(b)
Now subset to get your final result:
b[, a[, "name"]]
# a b c d
# [1,] 0.1 0.3 0.4 0.2
# [2,] 0.2 0.4 0.7 0.1

The error :
[.data.frame(b, , a[, "name"]) : undefined columns selected
means that you try to get a column non defined in b but exist in a$name. One solution is to use intersect with colnames(b). This will convert also the factor to a string and you get the right order.
b[, intersect(a[, "name"],colnames(b))] ## the order is important here
For example , I test this with this data:
b <- read.table(text='
a d b c
0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.7',header=TRUE)
a <- read.table(text='name score
a 0.1
z 0.5
c 0.1
d 0.6',header=TRUE)
b[, intersect(a[, "name"],colnames(b))]
a c d
1 0.1 0.4 0.2
2 0.2 0.7 0.1

If your data originates as an R data structure then it would be perverse to export it and solve this problem using Perl. However, if you have text files that look like the data you have shown, then here is a Perl solution for you.
I have split the output on spaces. That can be changed very simply if necessary.
use strict;
use warnings;
use autodie;
sub read_file {
my ($name) = #_;
open my $fh, '<', $name;
my #data = map [ split ], <$fh>;
\#data;
}
my $matrix_a = read_file('MatrixA.txt');
my #fields = map $matrix_a->[$_][0], 1 .. $#$matrix_a;
my $matrix_b = read_file('MatrixB.txt');
my #headers = #{$matrix_b->[0]};
my #indices = map {
my $label = $_;
grep $headers[$_] eq $label, 0..$#headers
} #fields;
for my $row (0 .. $#$matrix_b) {
print join(' ', map $matrix_b->[$row][$_], #indices), "\n";
}
output
a b c d
0.1 0.3 0.4 0.2
0.2 0.4 0.7 0.1

Related

How split a heatmap at the specific rows? Error invalid for atomic vectors

I am trying to build a heatmap in R. I wanted to split the heatmap at specific rows. For example my matrix is as:
ID A B C
FD_1 0.3 0.2 1
FD_2 0.4 1 0.9
FD_3 0.6 0.8 0.2
FS_1 0.3 0.2 1
FS_2 0.4 1 0.9
FS_3 0.6 0.8 0.2
FS_4 0.4 1 0.9
FS_5 0.6 0.8 0.2
FE_1 0.3 0.2 1
FE_2 0.4 1 0.9
FE_3 0.6 0.8 0.2
FE_4 0.4 1 0.9
I need to make a heatmap that includes 3 slice: one for 3 FD, one for 5 FS and one for 4 FE. And label each slice with their name as FD, FS and FE.
I'm using this code:
Heatmap(M_matrix, name = "level", row_split = M_matrix$ID)
But I'm getting this error:
Error in M_matrix$ID : $ operator is invalid for atomic vectors
Any suggestion?
Thanks
You can define the splits based on your ID column:
library(ComplexHeatmap)
ID=c(paste0("FD_", 1:3), paste0("FS_", 1:5), paste0("FE_", 1:4))
df <- data.frame(ID=ID,
matrix(rnorm(3*12, mean = 3), ncol=3,
dimnames=list(ID, LETTERS[1:3])),
stringsAsFactors = FALSE)
splits <- factor(gsub("_.*", "", ID))
Heatmap(matrix=as.matrix(df[,-1] ), row_split = splits, cluster_row_slices = FALSE)
If you want list of dataframes based on ID, we can use split.
list_df <- split(df, sub("_.*", "", df$ID))
list_df
#$FD
# ID A B C
#1 FD_1 0.3 0.2 1.0
#2 FD_2 0.4 1.0 0.9
#3 FD_3 0.6 0.8 0.2
#$FE
# ID A B C
#9 FE_1 0.3 0.2 1.0
#10 FE_2 0.4 1.0 0.9
#11 FE_3 0.6 0.8 0.2
#12 FE_4 0.4 1.0 0.9
#$FS
# ID A B C
#4 FS_1 0.3 0.2 1.0
#5 FS_2 0.4 1.0 0.9
#6 FS_3 0.6 0.8 0.2
#7 FS_4 0.4 1.0 0.9
#8 FS_5 0.6 0.8 0.2
We can use group_split
library(dplyr)
library(stringr)
list_df <- df %>%
group_split(grp = str_remove(ID, "_.*"), keep = FALSE)

convert from long to symmetrical square wide format in R

I would like to convert this dataframe
tmp <- data.frame(V1=c("A","A","B"),V2=c("B","C","C"),V3=c(0.2,0.4,0.1))
tmp
V1 V2 V3
1 A B 0.2
2 A C 0.4
3 B C 0.1
into a square matrix like this (which should ultimately be a dist object
A B C
A 0
B 0.2 0
C 0.4 0.1 0
I tried different approaches based on functions reshape, spread or xtabs but I cannot get the right dimension. Thanks for your help.
Maybe you can try the code below
d <- sort(unique(unlist(tmp[1:2])))
m <- `dimnames<-`(matrix(0,length(d),length(d)),list(d,d))
m[as.matrix(tmp[1:2])] <- tmp$V3
res <- t(m) + m
such that
> res
A B C
A 0.0 0.2 0.4
B 0.2 0.0 0.1
C 0.4 0.1 0.0
You can also create your own dist object this way using structure:
tmp_lab <- unique(c(as.character(tmp$V1), as.character(tmp$V2)))
structure(tmp$V3,
Size = length(tmp_lab),
Labels = tmp_lab,
Diag = TRUE,
Upper = FALSE,
method = "user",
class = "dist")
Output
A B C
A 0.0
B 0.2 0.0
C 0.4 0.1 0.0
Here is an option with xtabs after converting the columns 'V1' , 'V2' to factor with levels specified as the same
tmp[1:2] <- lapply(tmp[1:2], factor, levels = c('A', 'B', 'C'))
as.dist(xtabs(V3 ~ V2 + V1, tmp), diag = TRUE)
# A B C
#A 0.0
#B 0.2 0.0
#C 0.4 0.1 0.0

Lower RAM consumption for a transformation of a transition matrix

I've written the following two functions, that take as input a transition matrix and which nodes should be at absorbing states and transforms it.
The first function set.absorbing.states() has 3 arguments. tm is the initial transition matrix, the second one inn is one specified innitial node, while the third one soi is the set of interest. By 'set of interest', I mean a set of nodes in that matrix that must been set as absorbing states. Such an initial matrix is the following:
tm <- read.table(row.names=1, header=FALSE, text="
A 0.2 0.3 0.1 0.2 0.1 0.1
B 0.3 0.1 0.1 0.2 0.2 0.1
C 0 0.2 0.4 0.1 0.2 0.1
D 0.2 0.1 0.2 0.3 0.1 0.1
E 0.2 0.2 0.1 0.2 0.1 0.2
F 0.3 0.2 0.4 0.1 0 0")
colnames(tm) <- row.names(tm)
As you can see there are no absorbing states in that matrix. Let's say for example that we want to set as absorbing states the A and E and a randomly selected initial node B.
By executing the first function tm1 <- set.absorbing.states( tm , "B", c("A","E")) we are getting back a matrix that the absorbing states have been setted:
A B C D E F
A 1.0 0.0 0.0 0.0 0.0 0.0
B 0.3 0.1 0.1 0.2 0.2 0.1
C 0.0 0.2 0.4 0.1 0.2 0.1
D 0.2 0.1 0.2 0.3 0.1 0.1
E 0.0 0.0 0.0 0.0 1.0 0.0
F 0.3 0.2 0.4 0.1 0.0 0.0
As you can see, A and E have been changed into absorbing states.
The next step is to transform that matrix into a way that all absorbing state nodes (both rows and columns) go to the end. So by running ptm <- transform.tm( tm1, c("A","E") ) we get back a matrix that looks like:
B C D F A E
B 0.1 0.1 0.2 0.1 0.3 0.2
C 0.2 0.4 0.1 0.1 0.0 0.2
D 0.1 0.2 0.3 0.1 0.2 0.1
F 0.2 0.4 0.1 0.0 0.3 0.0
A 0.0 0.0 0.0 0.0 1.0 0.0
E 0.0 0.0 0.0 0.0 0.0 1.0
You can see now clearly that A and E nodes went to the end of that matrix.
Here follows the function I'm using.
set.absorbing.states <- function ( tm, inn, soi )
{
set <- which( row.names(tm) %in% soi )
set <- set[which( set != inn )]
for (i in set )
tm[i,] <- 0
for (i in set)
tm[i,i] <- 1
tm
}
transform.tm <- function ( tm, soi )
{
end_sets <- which(row.names(tm) %in% soi)
ptm <- rbind( cbind(tm[-end_sets, -end_sets], tm[-end_sets, end_sets]) , cbind(tm[end_sets, -end_sets], tm[end_sets, end_sets]) )
ptm
}
The thing now is that with such small matrices, everything is working properly. But I tried to use a big matrix (20.000*20.000) and it needed 32GB RAM to execute the second function.
So is there any way to execute this in more resource efficient way ?
Use indexing will significantly reduce the number of copies that your transformation function is creating (via rbind and cbind). It is probably a bit simpler conceptually (conditional on a solid understanding of indexing with [).
transform.tm1 <- function ( tm, soi ) {
newOrder <- c(setdiff(row.names(tm), soi), soi)
tm[newOrder, newOrder]
}
Here, setdiff is used to pull the non matching names and put them at the front a the vector. Then, simply reorder the matrix via row/column names.
This returns
transform.tm1(tm1, c("A", "E"))
B C D F A E
B 0.1 0.1 0.2 0.1 0.3 0.2
C 0.2 0.4 0.1 0.1 0.0 0.2
D 0.1 0.2 0.3 0.1 0.2 0.1
F 0.2 0.4 0.1 0.0 0.3 0.0
A 0.0 0.0 0.0 0.0 1.0 0.0
E 0.0 0.0 0.0 0.0 0.0 1.0
check that they return the same results
identical(transform.tm(tm1, c("A", "E")), transform.tm1(tm1, c("A", "E")))
[1] TRUE

R: Apply function on data frame A dependent on values of data frame B

I have two data frames A and B.
A = data.frame(x = c(3,-4,2), y=c(-4,7,1), z=c(-5,-1,6))
B = data.frame(x = c(0.5,0.9,0.3), y=c(0.7,0.2,0.1), z=c(0.9,0.8,0.6))
If a value in A is negative the corresponding value in B (the same position like in A) should be subtracted from 1. If the value in A is positive the corresponding value in B should not change.
In the end B should look like this
x y z
1 0.5 0.3 0.1
2 0.1 0.2 0.2
3 0.3 0.1 0.6
Anyone an idea how this problem can be solved?
Thanks in advance,
Christian
This seems to work: B[A<0] <- 1 - B[A<0]
x y z
1 0.5 0.3 0.1
2 0.1 0.2 0.2
3 0.3 0.1 0.6

Add columns to a list of data.frames

This is a similar question to an old post of mine Add columns to a dataframe based on values from a list.
Now, I would like to use the solution provide it, but instead of applying it to a single data.frame, I'd like to use it on a list of data.frame.
Briefly, I have a list of data.frames that looks like this:
df <- list(data.frame(A=c("a","b","c"),
B=c("1","2","1"),
C=c(0.1,0.7,0.4)),
data.frame(A=c("d","e","f"),
B=c("2","2","3"),
C=c(0.5,0.1,0.5)),
data.frame(A=c("g","h","i"),
B=c("3","1","2"),
C=c(0.2,0.1,0.5)))
And a list with elements which names match to df$B, i.e, these values are permutations of values from df$B, here is an example:
ll <- list('1'=c(0.1,0.1,0.4,0.2,0.1,0.4),
'2'=c(0.1,0.1,0.5,0.7,0.5,0.7),
'3'=c(0.1,0.1,0.2,0.2,0.2,0.5))
I want to create a new list of data.frames but with new columns in each dataframe of list df that correspond to the values of df$B in list ll but at the same time they are sampled values from ll?
Here is a desired output for a better explanation
> list.df
[[1]]
A B C P1 P2 P3 P4 P5 P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 2 0.7 0.1 0.5 0.7 0.1 0.5 0.1
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
[[2]]
A B C P1 P2 P3 P4 P5 P6
1 d 2 0.5 0.1 0.7 0.5 0.1 0.7 0.1
2 e 2 0.1 0.7 0.5 0.1 0.7 0.1 0.5
3 f 3 0.5 0.5 0.5 0.2 0.1 0.2 0.1
[[3]]
A B C P1 P2 P3 P4 P5 P6
1 g 3 0.2 0.1 0.5 0.2 0.2 0.2 0.5
2 h 1 0.1 0.2 0.1 0.4 0.2 0.2 0.4
3 i 2 0.5 0.1 0.5 0.1 0.1 0.5 0.7
The solution that I have for one single data.frame is this:
sampfun <- function(i, l) sample(l[[as.character(i)]], 10000, replace=TRUE)
list.df <- cbind(df, t(sapply(df$B, sampfun, l = ll)))
The problem is that I don't know how to implement this solution for use with a list of data.frames.
Many thanks for the help
Note: my real list of data.frames has 9,000 elements, and I look to add more than 10,000 columns so the memory and speed up are important.

Resources