I have a large matrix that contains various features extracted from microscopic cell images. The different features are distributed across the columns, the individual cells across the rows of that matrix. However, the measurements come from time lapse microscopy, such that each individual cell has 90 rows (time points) in that matrix. So this matrix has the dimension [cell_amount*90; feature_amount].
My goal is to:
calculate the difference of subsequent time points for each cell (the "derivative" of the time series), and then
create a new matrix that contains an aggregation of those differences for each cell (so that new matrix has the dimension [cell_amount; feature_amount]).
I set up some code in R to test my problem, where I have 4 cells, 4 features (columns) and each cell has 3 time point values. So the first cell would be on rows 1-3, the second on row 4-6, and so on. From this I calculate the difference of the values:
A <- matrix(sample(1:100, 4*12), ncol = 4)
B <- abs( A - dplyr::lag(A) )
B[seq(1,nrow(B), 3),] <- NA
This results in a matrix where the first row of each cell contains NA values:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] 82 29 54 22
[3,] 32 44 18 31
[4,] NA NA NA NA
[5,] 22 61 10 33
[6,] 19 64 54 35
[7,] NA NA NA NA
[8,] 59 18 6 10
[9,] 34 47 70 6
[10,] NA NA NA NA
[11,] 60 23 68 22
[12,] 17 13 12 9
The resulting matrix containing an aggregation for those values for each cell, in this case the variance, should then look like:
[,1] [,2] [,3] [,4]
[1,] 1250 112.5 648 40.5
[2,] 4.5 4.5 968 2
[3,] 312.5 420.5 2048 8
[4,] 924.5 50 1568 84.5
How can I calculate this new matrix in R? Any help is appreciated.
Because you used a random sample without a seed, I can't re-create your A matrix. However, here's a recreation of your B matrix.
B <- matrix(scan(text="
NA NA NA NA
82 29 54 22
32 44 18 31
NA NA NA NA
22 61 10 33
19 64 54 35
NA NA NA NA
59 18 6 10
34 47 70 6
NA NA NA NA
60 23 68 22
17 13 12 9"), ncol=4, byrow=T)
If you really want to keep this a matrix, you can reshape this into a multi-dimensional array and the use apply over the margins to get the value of interest, for example
apply(array(B, dim=c(3,4,4)),2:3, var, na.rm=T)
# [,1] [,2] [,3] [,4]
# [1,] 1250.0 112.5 648 40.5
# [2,] 4.5 4.5 968 2.0
# [3,] 312.5 420.5 2048 8.0
# [4,] 924.5 50.0 1568 84.5
You could also create a proper grouping variable and use aggregate()
row_sample <- rep(1:3, each=nrow(B)/3)
aggregate(B, list(row_sample), var, na.rm=T)
# Group.1 V1 V2 V3 V4
# 1 1 1250.0000 112.5000 648.0000 40.50000
# 2 2 496.3333 662.3333 709.3333 193.00000
# 3 3 469.0000 305.3333 1084.0000 72.33333
Related
I use a for loop (which works well) to replace randomly two values in each line of a dataset by NA (the indexes of this values are randomly changes at each line).
Now I would like to use apply() to do exactly the same thing.
I tried this code (as many other things which return NA everywhere):
my_fun<-function(x){if (j %in% sample(1:ncol(y),2)) {x[j]<-NA}}
apply(y,1,my_fun)
But it doesn't work (it does not make any change to the initial dataset).
The problem is that the object j is not found. j should be the number of the column.
Does someone have an idea?
From your description I argue that you want:
my_fun <- function(x) { x[sample(1:length(x), 2)] <- NA; x }
apply(y, 1, my_fun) # or
t(apply(y, 1, my_fun))
Testing the function:
set.seed(42)
y <- matrix(1:60, 10)
y
t(apply(y, 1, my_fun))
# > t(apply(y, 1, my_fun))
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 11 21 31 NA NA
# [2,] 2 NA 22 32 NA 52
# [3,] 3 13 NA NA 43 53
# [4,] NA 14 24 34 NA 54
# [5,] 5 15 25 NA 45 NA
# [6,] 6 16 NA NA 46 56
# [7,] 7 NA 27 37 47 NA
# [8,] 8 18 NA 38 NA 58
# [9,] NA 19 29 39 49 NA
# [10,] 10 20 NA 40 50 NA
I am trying to loop over the columns of a matrix and change certain predefined sequences within the colomns, which are available in form of vectors.
Let's say I have the following matrix:
m2 <- matrix(sample(1:36),9,4)
[,1] [,2] [,3] [,4]
[1,] 11 6 1 14
[2,] 22 16 27 3
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] 17 9 2 4
[6,] 28 18 29 5
[7,] 20 30 13 36
[8,] 26 33 24 15
[9,] 8 12 25 7
As an example my vector of sequence starts is a and my vector of sequence ends is b. Thus the first sequence to delete in all columns is a[1] to b[1], the 2nd a[2] to b[2] and so on.
My testing code is as follows:
testing <- function(x){
apply(x,2, function(y){
a <- c(1,5)
b <- c(2,8)
mapply(function(y){
y[a:b] <- NA; y
},a,b)
})
}
Expected outcome:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] 34 10 23 32
[4,] 21 19 31 35
[5,] NA NA NA NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
[9,] 8 12 25 7
Actual result:
Error in (function (y) : unused argument (dots[[2]][[1]])
What is wrong in the above code? I know I could just set the rows to NA, but I am trying to get the above output by using nested apply functions to learn more about them.
We get the sequence of corresponding elements of 'a', 'b' using Map, unlist to create a vector and assign the rows of 'm2' to NA based on that.
m2[unlist(Map(":", a, b)),] <- NA
m2
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# [3,] 34 10 23 32
# [4,] 21 19 31 35
# [5,] NA NA NA NA
# [6,] NA NA NA NA
# [7,] NA NA NA NA
# [8,] NA NA NA NA
# [9,] 8 12 25 7
I have read the description of by.column for rollapply in the manual but I couldn't understand how to use it. see below:
x=matrix(1:60,nrow=10)
library('zoo')
rollapply(x,3,mean,fill=NA,align="right",by.column=FALSE)
[1] NA NA 27 28 29 30 31 32 33 34
when i use by.column= FALSE: it applies mean to width (3) rolling number of lines mean(x[1:3,])
now, if I use by.column=TRUE then I get:
x=matrix(1:60,nrow=10)
rollapply(x,3,mean,fill=NA,align="right",by.column=TRUE)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA NA NA NA NA NA
[2,] NA NA NA NA NA NA
[3,] 2 12 22 32 42 52
[4,] 3 13 23 33 43 53
[5,] 4 14 24 34 44 54
[6,] 5 15 25 35 45 55
[7,] 6 16 26 36 46 56
[8,] 7 17 27 37 47 57
[9,] 8 18 28 38 48 58
[10,] 9 19 29 39 49 59
I can't make sense of the result. could anyone please explain what's the use of by.column and maybe provide an example?
by.column = TRUE (which is the default) with FUN = mean does a rolling mean separately for each column. The ith column of the result would be:
rollapplyr(x[, i], 3, mean, fill = NA)
by.column = FALSE inputs all columns at once to the function so in this case it would be the same as:
c(NA, NA, sapply(1:8, function(ix) mean(x[seq(ix, ix+2), ])))
How could I Replace a NA with mean of its previous and next rows in a fast manner?
name grade
1 A 56
2 B NA
3 C 70
4 D 96
such that B's grade would be 63.
Or you may try na.approx from package zoo: "Missing values (NAs) are replaced by linear interpolation"
library(zoo)
x <- c(56, NA, 70, 96)
na.approx(x)
# [1] 56 63 70 96
This also works if you have more than one consecutive NA:
vals <- c(1, NA, NA, 7, NA, 10)
na.approx(vals)
# [1] 1.0 3.0 5.0 7.0 8.5 10.0
na.approx is based on the base function approx, which may be used instead:
vals <- c(1, NA, NA, 7, NA, 10)
xout <- seq_along(vals)
x <- xout[!is.na(vals)]
y <- vals[!is.na(vals)]
approx(x = x, y = y, xout = xout)$y
# [1] 1.0 3.0 5.0 7.0 8.5 10.0
Assume you have a data.frame df like this:
> df
name grade
1 A 56
2 B NA
3 C 70
4 D 96
5 E NA
6 F 95
Then you can use the following:
> ind <- which(is.na(df$grade))
> df$grade[ind] <- sapply(ind, function(i) with(df, mean(c(grade[i-1], grade[i+1]))))
> df
name grade
1 A 56
2 B 63
3 C 70
4 D 96
5 E 95.5
6 F 95
An alternative solution, using the median instead of mean, is represented by the na.roughfix function of the randomForest package.
As described in the documentation, it works with a data frame or numeric matrix.
Specifically, for numeric variables, NAs are replaced with column medians. For factor variables, NAs are replaced with the most frequent levels (breaking ties at random). If object contains no NAs, it is returned unaltered.
Using the same examples as #Henrik,
library(randomForest)
x <- c(56, NA, 70, 96)
na.roughfix(x)
#[1] 56 70 70 96
or with a larger matrix:
y <- matrix(1:50, nrow = 10)
y[sample(1:length(y), 4, replace = FALSE)] <- NA
y
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 11 21 31 41
# [2,] 2 12 22 32 42
# [3,] 3 NA 23 33 NA
# [4,] 4 14 24 34 44
# [5,] 5 15 25 35 45
# [6,] 6 16 NA 36 46
# [7,] 7 17 27 37 47
# [8,] 8 18 28 38 48
# [9,] 9 19 29 39 49
# [10,] 10 20 NA 40 50
na.roughfix(y)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 11 21.0 31 41
# [2,] 2 12 22.0 32 42
# [3,] 3 16 23.0 33 46
# [4,] 4 14 24.0 34 44
# [5,] 5 15 25.0 35 45
# [6,] 6 16 24.5 36 46
# [7,] 7 17 27.0 37 47
# [8,] 8 18 28.0 38 48
# [9,] 9 19 29.0 39 49
#[10,] 10 20 24.5 40 50
Q.I have a erdos.reyni graph. I infect a vertex and want to see what sequence of vertices the disease would follow? igraph has helful functions like get.adjacency(), neighbors().
Details. This is the adjacency matrix with vertex names instead of 0,1 flags and i'm trying to get the contagion chain out of it. Like the flow/sequence of an epidemic through a graph if a certain vertex is infected. Let's not worry about infection probabilities here (assume all vertices hit are infected with probability 1).
So suppose I hit vertex 1 (which is row 1 here). We see that it has outgoing links to vertex 4,5,18,22,23,24,25. So then the next vertices will be those connected to 4,5,18...25 i.e. those values in row4, row5, row18,... row25. Then, according to the model, the disease will travel through these and so forth.
I understand that I can pass a string to order the matrix rows. My problem is, I cannot figure out how to generate that sequence.
The matrix looks like this.
> channel
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 4 5 18 22 23 24 25 NA
[2,] 6 10 11 18 25 NA NA NA
[3,] 7 11 18 20 NA NA NA NA
[4,] 24 NA NA NA NA NA NA NA
[5,] 1 3 9 13 14 NA NA NA
[6,] 3 8 9 14 19 23 NA NA
[7,] 3 4 8 15 20 22 NA NA
[8,] 2 3 25 NA NA NA NA NA
[9,] 3 4 11 13 20 NA NA NA
[10,] 4 5 8 15 19 20 21 22
[11,] 3 13 15 18 19 23 NA NA
[12,] 11 13 16 NA NA NA NA NA
[13,] 4 6 14 15 16 17 19 21
[14,] 2 6 13 NA NA NA NA NA
[15,] 3 17 20 NA NA NA NA NA
[16,] 6 15 18 23 NA NA NA NA
[17,] 2 25 NA NA NA NA NA NA
[18,] 2 5 NA NA NA NA NA NA
[19,] 3 11 NA NA NA NA NA NA
[20,] 1 4 7 10 12 21 22 25
[21,] 2 4 6 13 14 16 18 NA
[22,] 1 3 4 15 23 NA NA NA
[23,] 1 16 24 NA NA NA NA NA
[24,] 7 8 19 20 22 NA NA NA
[25,] 7 12 13 17 NA NA NA NA
I want to reorder this matrix based on a selection criteria as follows:
R would be most helpful (but i'm interested in the algo so any python,ruby,etc.will be great).The resulting vector will have length of 115 (8x25=200 - 85 NAs=115). and would look like this. Which is basically how the disease would spread if vertex 1, becomes infected.
4,5,18,22,23,24,25,24,1,3,9,13,14,2,5,1,3,4,15,23,1,16,24,7,8,19,20,22,7,12,13,17,7,8,19,20,22, 4,5,18,22,23,24,25,7,11,18,20...
What I know so far:
1. R has a package **igraph** which lets me calculate neighbors(graph, vertex, "out")
2. The same package can also generate get.adjlist(graph...), get.adjacency
Finding a "contagion chain" like this is equivalent to a breadth-first search through the graph, e.g.:
library(igraph)
set.seed(50)
g = erdos.renyi.game(20, 0.1)
plot(g)
order = graph.bfs(g, root=14, order=TRUE, unreachable=FALSE)$order
Output:
> order
[1] 14 1 2 11 16 18 4 19 12 17 20 7 8 15 5 13 9 NaN NaN NaN
It's not clear how you define the ordering of the rows, so... just a few hints:
You can select a permutation/combination of rows by passing an index vector:
> (m <- matrix(data=1:9, nrow=3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> m[c(2,3,1),]
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 1 4 7
The function t() transposes a matrix.
The matrix is stored in columns-first (or column-major) order:
> as.vector(m)
[1] 1 2 3 4 5 6 7 8 9
NA values can be removed by subsetting:
> qq <- c(1,2,NA,5,7,NA,3,NA,NA)
> qq[!is.na(qq)]
[1] 1 2 5 7 3
Also, graph algorithms are provided by Bioconductor's graph or CRAN's igraph packages.