reordering rows of matrix by data sequence inside [duplicate] - r

Q.I have a erdos.reyni graph. I infect a vertex and want to see what sequence of vertices the disease would follow? igraph has helful functions like get.adjacency(), neighbors().
Details. This is the adjacency matrix with vertex names instead of 0,1 flags and i'm trying to get the contagion chain out of it. Like the flow/sequence of an epidemic through a graph if a certain vertex is infected. Let's not worry about infection probabilities here (assume all vertices hit are infected with probability 1).
So suppose I hit vertex 1 (which is row 1 here). We see that it has outgoing links to vertex 4,5,18,22,23,24,25. So then the next vertices will be those connected to 4,5,18...25 i.e. those values in row4, row5, row18,... row25. Then, according to the model, the disease will travel through these and so forth.
I understand that I can pass a string to order the matrix rows. My problem is, I cannot figure out how to generate that sequence.
The matrix looks like this.
> channel
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 4 5 18 22 23 24 25 NA
[2,] 6 10 11 18 25 NA NA NA
[3,] 7 11 18 20 NA NA NA NA
[4,] 24 NA NA NA NA NA NA NA
[5,] 1 3 9 13 14 NA NA NA
[6,] 3 8 9 14 19 23 NA NA
[7,] 3 4 8 15 20 22 NA NA
[8,] 2 3 25 NA NA NA NA NA
[9,] 3 4 11 13 20 NA NA NA
[10,] 4 5 8 15 19 20 21 22
[11,] 3 13 15 18 19 23 NA NA
[12,] 11 13 16 NA NA NA NA NA
[13,] 4 6 14 15 16 17 19 21
[14,] 2 6 13 NA NA NA NA NA
[15,] 3 17 20 NA NA NA NA NA
[16,] 6 15 18 23 NA NA NA NA
[17,] 2 25 NA NA NA NA NA NA
[18,] 2 5 NA NA NA NA NA NA
[19,] 3 11 NA NA NA NA NA NA
[20,] 1 4 7 10 12 21 22 25
[21,] 2 4 6 13 14 16 18 NA
[22,] 1 3 4 15 23 NA NA NA
[23,] 1 16 24 NA NA NA NA NA
[24,] 7 8 19 20 22 NA NA NA
[25,] 7 12 13 17 NA NA NA NA
I want to reorder this matrix based on a selection criteria as follows:
R would be most helpful (but i'm interested in the algo so any python,ruby,etc.will be great).The resulting vector will have length of 115 (8x25=200 - 85 NAs=115). and would look like this. Which is basically how the disease would spread if vertex 1, becomes infected.
4,5,18,22,23,24,25,24,1,3,9,13,14,2,5,1,3,4,15,23,1,16,24,7,8,19,20,22,7,12,13,17,7,8,19,20,22, 4,5,18,22,23,24,25,7,11,18,20...
What I know so far:
1. R has a package **igraph** which lets me calculate neighbors(graph, vertex, "out")
2. The same package can also generate get.adjlist(graph...), get.adjacency

Finding a "contagion chain" like this is equivalent to a breadth-first search through the graph, e.g.:
library(igraph)
set.seed(50)
g = erdos.renyi.game(20, 0.1)
plot(g)
order = graph.bfs(g, root=14, order=TRUE, unreachable=FALSE)$order
Output:
> order
[1] 14 1 2 11 16 18 4 19 12 17 20 7 8 15 5 13 9 NaN NaN NaN

It's not clear how you define the ordering of the rows, so... just a few hints:
You can select a permutation/combination of rows by passing an index vector:
> (m <- matrix(data=1:9, nrow=3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> m[c(2,3,1),]
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 1 4 7
The function t() transposes a matrix.
The matrix is stored in columns-first (or column-major) order:
> as.vector(m)
[1] 1 2 3 4 5 6 7 8 9
NA values can be removed by subsetting:
> qq <- c(1,2,NA,5,7,NA,3,NA,NA)
> qq[!is.na(qq)]
[1] 1 2 5 7 3
Also, graph algorithms are provided by Bioconductor's graph or CRAN's igraph packages.

Related

Nested for loops and nested if...else statements very slow

I am fairly new with R and I wrote a piece of code to iterate over a data frame (cell by cell) to perform a 'simple' check by if...else statements but it turns out it is very slow.
here the code
for(m in 1:(ncol(df) - 1)){
listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])
for(n in 1:nrow(df)){
if (is.na(df[n, m]) == TRUE){
df[n, m] <- NA
} else {
if (sum(as.numeric(unlist(strsplit(as.character(df[n, 33]), split = ";"))) %in% listmatch, na.rm = TRUE) > 0){
df[n, m] <- 1
} else {
df[n, m] <- NA
}}
next}
next}
Just to give some dimensions, df is 80000 obs of 33 variable, listmatch is on average a vector of 5000 values, while the column 33 of df contains information such as
1;2;3;4;5;6;7;8;9;10;11;12;13...
on average 150 values separated by semicolumn.
I do not understand why it is very slow, neither how to optimize it and make it faster.
Is it more the nested for loops or the nested if...else statements that make it slow?
Any suggestions?
Here an example of my data
df
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,33]
[1,] NA NA NA NA NA 1 NA 1;8;9;6
[2,] 1 NA NA NA NA NA NA 5
[3,] 1 NA NA 1 NA NA NA 6
[4,] 1 NA 1 NA 1 1 1 88;98;125
[5,] NA NA NA NA NA NA NA 2;3
[6,] 1 NA NA NA NA NA NA 4;5
[7,] NA NA NA NA NA NA 1 44
[8,] NA NA 1 1 1 NA NA 46
[9,] NA 1 1 NA NA NA NA 78
[10,] 11 NA NA NA NA NA NA 147;411
[11,] NA 1 NA NA NA NA NA 400
[12,] NA 1 1 NA NA NA NA 658
[13,] NA 1 NA 1 NA NA NA 87;988;1226
[14,] NA NA NA NA 1 1 NA 121;232;321
[15,] NA NA 1 NA NA 1 NA 5
while dt looks like this
Text Experiment Number id
[1,] df 5 3 1
[2,] fd 8 8 8
[3,] reh 7 4 6
[4,] egfdgsd 4 8 8
[5,] fbdsgryj 5 3 44
[6,] rtryyukfn 7 2 411
[7,] fgsfde 5 6 400
[8,] wthrthmgh 2 7 1274
[9,] vdfd 2 8 658
[10,] vfr 8 6 147
[11,] brtey 3 2 1226
[12,] tyirt 7 5 988
[13,] fyi 5 4 87
[14,] ywrtj 7 5 46
[15,] kyou 8 6 4
[16,] pkjh 1 7 8
[17,] retuyk 5 4 9
[18,] re 7 8 11
[19,] eryre 6 3 12
[20,] fdhd 6 1 18
I forgot to mention that I would like to keep the code
listmatch <- unlist(subset(dt, Experiment == m & Number == m)[[which(colnames(dt) == "id")]])
as it is, at the moment I am filtering based on the value of m but I would like in the future to apply manually different filtering, i.e. ...Experiment == m*2 & Number == m*5)...

R - =if max statement from excel to R

Within a data frame. I want to compare today's value against a look back 'n' day period.
I know how to do it in excel for comparing today's value to see if it was higher than the previous 10 days.
=IF(A11>MAX(A1:A10),1,0)
How can I do the same logic within a function in R?
The output would look like this below:
Column Output
1 12 NA
2 13 NA
3 14 NA
4 15 NA
5 9 NA
6 9 NA
7 7 NA
8 8 NA
9 16 NA
10 17 NA
11 20 1
12 14 0
13 9 0
14 8 0
15 6 0
16 5 0
17 28 1
In row 11. Because value 20 is higher than the previous 10 days it denotes a 1 value.
In row 12, because value 14 is not the highest number in previous 10 days it receives a 0 value.
And it of course rolls on moving 10 day window.
P Lapointe's answer is great but anytime I'm doing a 'rolling' calculation my first instinct is to think of rollapply from the zoo package.
is_last_greatest <- function(x){
#' Given an input vector this returns
#' 1 if the last element is greater than
#' all of the other elements and 0 otherwise
ifelse(all(tail(x,1) > head(x,-1)), 1, 0)
}
# We want to compare to the previous 10 values but the function
# I wrote requires us to include the value we're using as
# comparison so I set the width to 11
output <- rollapply(dat,
width = 11,
FUN = is_last_greatest,
fill = NA,
align = "right")
cbind(dat, output)
which gives
dat vals
[1,] 12 NA
[2,] 13 NA
[3,] 14 NA
[4,] 15 NA
[5,] 9 NA
[6,] 9 NA
[7,] 7 NA
[8,] 8 NA
[9,] 16 NA
[10,] 17 NA
[11,] 20 1
[12,] 14 0
[13,] 9 0
[14,] 8 0
[15,] 6 0
[16,] 5 0
[17,] 28 1
Here's how to do that with roll_maxr from RcppRoll.
library(RcppRoll)
df$Output2 <- ifelse(df$Column>roll_maxr(lag(df$Column),11, na.rm = TRUE),1,0)
Column Output Output2
1 12 NA NA
2 13 NA NA
3 14 NA NA
4 15 NA NA
5 9 NA NA
6 9 NA NA
7 7 NA NA
8 8 NA NA
9 16 NA NA
10 17 NA NA
11 20 1 1
12 14 0 0
13 9 0 0
14 8 0 0
15 6 0 0
16 5 0 0
17 28 1 1
data
df <- read.table(text=" Column Output
1 12 NA
2 13 NA
3 14 NA
4 15 NA
5 9 NA
6 9 NA
7 7 NA
8 8 NA
9 16 NA
10 17 NA
11 20 1
12 14 0
13 9 0
14 8 0
15 6 0
16 5 0
17 28 1",header=TRUE,stringsAsFactors=FALSE)
Here is a base R method using embed to construct the comparison sets and performing the comparisons with apply.
# get a matrix for comparisons
myMat <- embed(df$Column, 11)
Here, this returns
myMat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 20 17 16 8 7 9 9 15 14 13 12
[2,] 14 20 17 16 8 7 9 9 15 14 13
[3,] 9 14 20 17 16 8 7 9 9 15 14
[4,] 8 9 14 20 17 16 8 7 9 9 15
[5,] 6 8 9 14 20 17 16 8 7 9 9
[6,] 5 6 8 9 14 20 17 16 8 7 9
[7,] 28 5 6 8 9 14 20 17 16 8 7
So the goal is to compare the value in the first column with those in the remaining column for each row.
as.integer(max.col(myMat) == 1L)
[1] 1 0 0 0 0 0 1
Now, tack on the appropriate number of NA values, which is the number of columns in myMat minus one.
df$output2 <- c(rep(NA, ncol(myMat) - 1), as.integer(max.col(myMat) == 1L))
this returns
df
Column Output output2
1 12 NA NA
2 13 NA NA
3 14 NA NA
4 15 NA NA
5 9 NA NA
6 9 NA NA
7 7 NA NA
8 8 NA NA
9 16 NA NA
10 17 NA NA
11 20 1 1
12 14 0 0
13 9 0 0
14 8 0 0
15 6 0 0
16 5 0 0
17 28 1 1
An advantage of max.col is that it is quite fast. One of its biggest drawbacks is that it does not have an na.rm argument to remove missing values. In the case that there are missing values, here is a method using apply on myMat instead of max.col.
apply(myMat, 1, function(x) as.integer(all(head(x, 1) > tail(x, -1))))
The operating comparison function here is
all(head(x, 1) > tail(x, -1))
Functions that produce the same result include the following
head(x, 1) == max(x) # or
x[1] == max(x)
and
1L == which.max(x)

Loop columns of matrix with nested apply

I am trying to loop over the columns of a matrix and change certain predefined sequences within the colomns, which are available in form of vectors.
Let's say I have the following matrix:
m2 <- matrix(sample(1:36),9,4)
[,1] [,2] [,3] [,4]
[1,] 11 6 1 14
[2,] 22 16 27 3
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] 17 9 2 4
[6,] 28 18 29 5
[7,] 20 30 13 36
[8,] 26 33 24 15
[9,] 8 12 25 7
As an example my vector of sequence starts is a and my vector of sequence ends is b. Thus the first sequence to delete in all columns is a[1] to b[1], the 2nd a[2] to b[2] and so on.
My testing code is as follows:
testing <- function(x){
apply(x,2, function(y){
a <- c(1,5)
b <- c(2,8)
mapply(function(y){
y[a:b] <- NA; y
},a,b)
})
}
Expected outcome:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] NA NA NA NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
[9,] 8 12 25 7
Actual result:
Error in (function (y) : unused argument (dots[[2]][[1]])
What is wrong in the above code? I know I could just set the rows to NA, but I am trying to get the above output by using nested apply functions to learn more about them.
We get the sequence of corresponding elements of 'a', 'b' using Map, unlist to create a vector and assign the rows of 'm2' to NA based on that.
m2[unlist(Map(":", a, b)),] <- NA
m2
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# [3,] 34 10 23 32
# [4,] 21 19 31 35
# [5,] NA NA NA NA
# [6,] NA NA NA NA
# [7,] NA NA NA NA
# [8,] NA NA NA NA
# [9,] 8 12 25 7

Get contagion chain from adjacency matrix, r, igraph

Q.I have a erdos.reyni graph. I infect a vertex and want to see what sequence of vertices the disease would follow? igraph has helful functions like get.adjacency(), neighbors().
Details. This is the adjacency matrix with vertex names instead of 0,1 flags and i'm trying to get the contagion chain out of it. Like the flow/sequence of an epidemic through a graph if a certain vertex is infected. Let's not worry about infection probabilities here (assume all vertices hit are infected with probability 1).
So suppose I hit vertex 1 (which is row 1 here). We see that it has outgoing links to vertex 4,5,18,22,23,24,25. So then the next vertices will be those connected to 4,5,18...25 i.e. those values in row4, row5, row18,... row25. Then, according to the model, the disease will travel through these and so forth.
I understand that I can pass a string to order the matrix rows. My problem is, I cannot figure out how to generate that sequence.
The matrix looks like this.
> channel
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 4 5 18 22 23 24 25 NA
[2,] 6 10 11 18 25 NA NA NA
[3,] 7 11 18 20 NA NA NA NA
[4,] 24 NA NA NA NA NA NA NA
[5,] 1 3 9 13 14 NA NA NA
[6,] 3 8 9 14 19 23 NA NA
[7,] 3 4 8 15 20 22 NA NA
[8,] 2 3 25 NA NA NA NA NA
[9,] 3 4 11 13 20 NA NA NA
[10,] 4 5 8 15 19 20 21 22
[11,] 3 13 15 18 19 23 NA NA
[12,] 11 13 16 NA NA NA NA NA
[13,] 4 6 14 15 16 17 19 21
[14,] 2 6 13 NA NA NA NA NA
[15,] 3 17 20 NA NA NA NA NA
[16,] 6 15 18 23 NA NA NA NA
[17,] 2 25 NA NA NA NA NA NA
[18,] 2 5 NA NA NA NA NA NA
[19,] 3 11 NA NA NA NA NA NA
[20,] 1 4 7 10 12 21 22 25
[21,] 2 4 6 13 14 16 18 NA
[22,] 1 3 4 15 23 NA NA NA
[23,] 1 16 24 NA NA NA NA NA
[24,] 7 8 19 20 22 NA NA NA
[25,] 7 12 13 17 NA NA NA NA
I want to reorder this matrix based on a selection criteria as follows:
R would be most helpful (but i'm interested in the algo so any python,ruby,etc.will be great).The resulting vector will have length of 115 (8x25=200 - 85 NAs=115). and would look like this. Which is basically how the disease would spread if vertex 1, becomes infected.
4,5,18,22,23,24,25,24,1,3,9,13,14,2,5,1,3,4,15,23,1,16,24,7,8,19,20,22,7,12,13,17,7,8,19,20,22, 4,5,18,22,23,24,25,7,11,18,20...
What I know so far:
1. R has a package **igraph** which lets me calculate neighbors(graph, vertex, "out")
2. The same package can also generate get.adjlist(graph...), get.adjacency
Finding a "contagion chain" like this is equivalent to a breadth-first search through the graph, e.g.:
library(igraph)
set.seed(50)
g = erdos.renyi.game(20, 0.1)
plot(g)
order = graph.bfs(g, root=14, order=TRUE, unreachable=FALSE)$order
Output:
> order
[1] 14 1 2 11 16 18 4 19 12 17 20 7 8 15 5 13 9 NaN NaN NaN
It's not clear how you define the ordering of the rows, so... just a few hints:
You can select a permutation/combination of rows by passing an index vector:
> (m <- matrix(data=1:9, nrow=3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> m[c(2,3,1),]
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 1 4 7
The function t() transposes a matrix.
The matrix is stored in columns-first (or column-major) order:
> as.vector(m)
[1] 1 2 3 4 5 6 7 8 9
NA values can be removed by subsetting:
> qq <- c(1,2,NA,5,7,NA,3,NA,NA)
> qq[!is.na(qq)]
[1] 1 2 5 7 3
Also, graph algorithms are provided by Bioconductor's graph or CRAN's igraph packages.

melt the lower half matrix in R

How can I melt a lower half triangle plus diagonal matrix ?
11 NA NA NA NA
12 22 NA NA NA
13 23 33 NA NA
14 24 34 44 NA
15 25 35 45 55
A <- t(matrix (c(11, NA, NA, NA, NA, 12, 22, NA, NA, NA,
13, 23, 33, NA, NA, 14, 24, 34, 44, NA,15, 25,
35, 45, 55), ncol = 5))
> A
[,1] [,2] [,3] [,4] [,5]
[1,] 11 NA NA NA NA
[2,] 12 22 NA NA NA
[3,] 13 23 33 NA NA
[4,] 14 24 34 44 NA
[5,] 15 25 35 45 55
To data.frame in row and col (preserving the following order)
col row value
1 1 11
1 2 12
1 3 13
1 4 14
1 5 15
2 2 22
2 3 23
2 4 24
2 5 25
3 3 33
3 4 34
3 5 35
4 4 44
4 5 45
5 5 55
If you want the indices as columns as well, this should work:
m <- matrix(1:25,5,5)
m[upper.tri(m)] <- NA
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] 2 7 NA NA NA
[3,] 3 8 13 NA NA
[4,] 4 9 14 19 NA
[5,] 5 10 15 20 25
cbind(which(!is.na(m),arr.ind = TRUE),na.omit(as.vector(m)))
row col
[1,] 1 1 1
[2,] 2 1 2
[3,] 3 1 3
[4,] 4 1 4
[5,] 5 1 5
[6,] 2 2 7
[7,] 3 2 8
[8,] 4 2 9
[9,] 5 2 10
[10,] 3 3 13
[11,] 4 3 14
[12,] 5 3 15
[13,] 4 4 19
[14,] 5 4 20
[15,] 5 5 25
I guess I'll explain this a bit. I'm using three "tricks":
The arr.ind argument to which to get the indices
The very useful na.omit function to avoid some extra typing
The fact that R stores matrices in column major form, hence as.vector returns the values in the right order.
My one liner.
reshape2::melt(A, varnames = c('row', 'col'), na.rm = TRUE)
Here's my first solution:
test <- rbind(c(11,NA,NA,NA,NA),
c(12,22,NA,NA,NA),
c(13,23,33,NA,NA),
c(14,24,34,44,NA),
c(15,25,35,45,55)) ## Load the matrix
test2 <- as.vector(test) ## "melt" it into a vector
test <- cbind( test2[!is.na(test2)] ) ## get rid of NAs, cbind it into a column
Results are:
> test
[,1]
[1,] 11
[2,] 12
[3,] 13
[4,] 14
[5,] 15
[6,] 22
[7,] 23
[8,] 24
[9,] 25
[10,] 33
[11,] 34
[12,] 35
[13,] 44
[14,] 45
[15,] 55
Alternatively, you can use the matrix command:
test <- rbind(c(11,NA,NA,NA,NA),
c(12,22,NA,NA,NA),
c(13,23,33,NA,NA),
c(14,24,34,44,NA),
c(15,25,35,45,55)) ## Load the matrix
test2 <- matrix(test, ncol=1)
test <- cbind( test2[!is.na(test2), ] )
## same as above, except now explicitly noting rows to replace.
Here is my attempt:
# enter the data
df <- c(11,12,13,14,15,NA,22,23,24,25,NA,NA,33,34,35,NA,NA,NA,44,45,NA,NA,NA,NA,55)
dim(df) <- c(5,5)
df
# make new data frame with rows and column indicators
melteddf <- data.frame(
value=df[lower.tri(df,diag=T)],
col=rep(1:ncol(df),ncol(df):1),
row=unlist(sapply(1:nrow(df),function(x) x:nrow(df)))
)
I wish I knew about the arr.ind part of cbind which before now though.
Here is a method using arrayInd which is basically the same as #joran's but might be useful in other settings:
na.omit( data.frame(arrayInd(1:prod(dim(A)), dim(A)), value=c(A)) )
X1 X2 value
1 1 1 11
2 2 1 12
3 3 1 13
4 4 1 14
5 5 1 15
7 2 2 22
8 3 2 23
9 4 2 24
10 5 2 25
13 3 3 33
14 4 3 34
15 5 3 35
19 4 4 44
20 5 4 45
25 5 5 55

Resources