Combining more than 2 columns by removing NA's in R - r

At first sight this seems a duplicate of Combine/merge columns while avoiding NA? but in fact it isn't. I am dealing sometimes with more than two columns instead of just two.
My dataframe looks like this:
col1 col2 col3 col4 col5
[1,] 1 NA NA 13 NA
[2,] NA NA 10 NA 18
[3,] NA 7 NA 15 NA
[4,] 4 NA NA 16 NA
Now I want to "collapse" this dataframe into a dataframe with less columns and with removed NA's. In fact I am looking for and "excel way of doing": removing one cell and the whole row will move one cell to the left.
The result in this example case would be:
col1 col2
[1,] 1 13
[2,] 10 18
[3,] 7 15
[4,] 4 16
has anyone an idea about how to do this in R? Many thanks in advance!

You can use apply for this. If df is your dataframe`:
df2 <- apply(df,1,function(x) x[!is.na(x)])
df3 <- data.frame(t(df2))
colnames(df3) <- colnames(df)[1:ncol(df3)]
Output:
# col1 col2
# 1 13
# 10 18
# 7 15
# 4 16

You can use apply and na.exclude
DF
## V1 V2 V3 V4 V5
## 1 1 NA NA 13 NA
## 2 NA NA 10 NA 18
## 3 NA 7 NA 15 NA
## 4 4 NA NA 16 NA
t(apply(DF, 1, na.exclude))
## [,1] [,2]
## [1,] 1 13
## [2,] 10 18
## [3,] 7 15
## [4,] 4 16
If you want to keep the dimensions of the data.frame same, you can use sort with na.last=TRUE instead. This will also take care of cases where you have unequal number of values in different rows.
t(apply(DF, 1, sort, na.last = T))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 13 NA NA NA
## [2,] 10 18 NA NA NA
## [3,] 7 15 NA NA NA
## [4,] 4 16 NA NA NA

This function is a bit long-winded but (1) it will be faster in the long run and (2) it offers a good amount of flexibility:
myFun <- function(inmat, outList = TRUE, fill = NA, origDim = FALSE) {
## Split up the data by row and isolate the non-NA values
myList <- lapply(sequence(nrow(inmat)), function(x) {
y <- inmat[x, ]
y[!is.na(y)]
})
## If a `list` is all that you want, the function stops here
if (isTRUE(outList)) {
myList
} else {
## If you want a matrix instead, it goes on like this
Len <- vapply(myList, length, 1L)
## The new matrix can be either just the number of columns required
## or it can have the same number of columns as the input matrix
if (isTRUE(origDim)) Ncol <- ncol(inmat) else Ncol <- max(Len)
Nrow <- nrow(inmat)
M <- matrix(fill, ncol = Ncol, nrow = Nrow)
M[cbind(rep(sequence(Nrow), Len), sequence(Len))] <-
unlist(myList, use.names=FALSE)
M
}
}
To test it out, let's create a function to make up some dummy data:
makeData <- function(nrow = 10, ncol = 5, pctNA = .8, maxval = 25) {
a <- nrow * ncol
m <- matrix(sample(maxval, a, TRUE), ncol = ncol)
m[sample(a, a * pctNA)] <- NA
m
}
set.seed(1)
m <- makeData(nrow = 5, ncol = 4, pctNA=.6)
m
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] 10 24 NA 18
# [3,] NA 17 NA 25
# [4,] NA 16 10 NA
# [5,] NA 2 NA NA
... and apply it...
myFun(m)
# [[1]]
# integer(0)
#
# [[2]]
# [1] 10 24 18
#
# [[3]]
# [1] 17 25
#
# [[4]]
# [1] 16 10
#
# [[5]]
# [1] 2
myFun(m, outList = FALSE)
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] 10 24 18
# [3,] 17 25 NA
# [4,] 16 10 NA
# [5,] 2 NA NA
## Try also
## myFun(m, outList = FALSE, origDim = TRUE)
And, let's run some timings on bigger data in comparison to the other answers so far:
set.seed(1)
m <- makeData(nrow = 1e5, ncol = 5, pctNA = .75)
## Will return a matrix
funCP <- function(inmat) t(apply(inmat, 1, sort, na.last = T))
system.time(funCP(m))
# user system elapsed
# 9.776 0.000 9.757
## Will return a list in this case
funJT <- function(inmat) apply(inmat, 1, function(x) x[!is.na(x)])
system.time(JT <- funJT(m))
# user system elapsed
# 0.577 0.000 0.575
## Output a list
system.time(AM <- myFun(m))
# user system elapsed
# 0.469 0.000 0.466
identical(JT, AM)
# [1] TRUE
## Output a matrix
system.time(myFun(m, outList=FALSE, origDim=TRUE))
# user system elapsed
# 0.610 0.000 0.612
So, the list output appears slightly faster than #JT85's solution, and the matrix output appears slightly slower. But, compared to using sort row-by-row, this is a definite improvement.

Related

Lapply over several parameters, faster method

Suppose I have two vectors
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
and a function
calc <- function(x,y){x + y)
I want to apply this function for the 1st value in a for each value in b. Suppose in my case calc only allows a single value from a and b as input, so lapply(a,calc,b) wouldn't work because the length(b) is not 1 then (gives me an error).
Also mapply doesnt give me the wanted solution either, it only applies the function on paired values, i.e. 1+6, 2+7, etc.
So I built a function that gave me the wanted solution
myfunc <- function(z){lapply(a,calc,z)}
and applied it on b
solution <- lapply(b,myfunc)
We see here that the difference to lapply(a,calc,b) or a nested lapply(a,lapply,calc,b) is that it gives me all the values in its own list. Thats what I wanted or at least it was a function that gave me the right result with no error.
Now, is there a faster/ more trivial method, because I just experimented here a little. And with my function which is much larger than calc it takes 10 minutes, but maybe I have to slim down my original function and there would not be a faster method here...
EDIT:
In my function there is something like this,
calc <- function(x,y){
# ...
number <- x
example <- head(number,n=y)
# ...
}
where a vector as an input for y doesnt work anymore. With lapply(a,lapply,calc,b) or lapply(a,calc,b) I get an error,
Error in head.default(number, n = y) : length(n) == 1L is not TRUE
As Florian says, outer() could be an option.
outer(a, b, calc)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 7 8 9 10 11
# [2,] 8 9 10 11 12
# [3,] 9 10 11 12 13
# [4,] 10 11 12 13 14
# [5,] 11 12 13 14 15
But as MichaelChirico mentions, with a function that isn't vectorized it won't work. In that case something else has to be hacked together. These might or might not be quicker than your current solution.
All combinations (so both calc(1, 6) and calc(6, 1) are performed, similar to outer()
Number of calculations: n2
eg <- expand.grid(a, b)
m1 <- mapply(calc, eg[,1], eg[, 2])
matrix(m1, 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 7 8 9 10 11
# [2,] 8 9 10 11 12
# [3,] 9 10 11 12 13
# [4,] 10 11 12 13 14
# [5,] 11 12 13 14 15
Only unique combinations (so assumes your function is symmetric)
Number of calculations: (n2 - n) / 2
cn <- t(combn(1:length(a), 2))
m2 <- mapply(calc, a[cn[, 1]], b[cn[, 2]])
mat <- matrix(, length(a), length(a))
mat[upper.tri(mat)] <- m2
mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 8 9 10 11
# [2,] NA NA 10 11 12
# [3,] NA NA NA 12 13
# [4,] NA NA NA NA 14
# [5,] NA NA NA NA NA
This second one ignores the diagonal, but adding those values are easy, as that's what the OPs mapply() call returned.
diag(mat) <- mapply(calc, a, b)
mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 7 8 9 10 11
# [2,] NA 9 10 11 12
# [3,] NA NA 11 12 13
# [4,] NA NA NA 13 14
# [5,] NA NA NA NA 15
This solved it for me, adding SIMPLIFY=FALSE to the mapply function, thanks to #AkselA.
eg <- expand.grid(a, b)
m1 <- mapply(calc, eg[,1], eg[, 2],SIMPLIFY=FALSE)
However, this method is only slightly faster than my own solution in my OP.

Convert a matrix to a list of matrices based on a vector of row lengths

I want to split a large matrix, mt, into a list of sub-matrices, res. The number of rows for each sub-matrix is specified by a vector, len.
For example,
> mt=matrix(c(1:20),ncol=2)
> mt
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
lens=c(2,3,5)
What I want is a function some_function, that can offer the following result,
> res=some_function(mt,lens)
> res
[[1]]
[,1] [,2]
[1,] 1 11
[2,] 2 12
[[2]]
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
[[3]]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Speed is a big concern. The faster, the better!
Many thanks!
A function to create index based on length of each value and split the matrix.
mt <- matrix(c(1:20), ncol=2)
# Two arguments: m - matrix, len - length of each group
m_split <- function(m, len){
index <- 1:sum(len)
group <- rep(1:length(len), times = len)
index_list <- split(index, group)
mt_list <- lapply(index_list, function(vec) mt[vec, ])
return(mt_list)
}
m_split(mt, c(2, 3, 5))
$`1`
[,1] [,2]
[1,] 1 11
[2,] 2 12
$`2`
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
$`3`
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Update
I used the following code to compare the performance of each method in this post.
library(microbenchmark)
library(data.table)
# Test case from #missuse
mt <- matrix(c(1:20000000),ncol=10)
lens <- c(20000,15000,(nrow(mt)-20000-15000))
# Functions from #Damiano Fantini
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
# Benchmarking
microbenchmark(m1 = {m_split(mt, lens)}, # #ycw's method
m2 = {pam = rep(1:length(lens), times = lens)
split(data.table(mt), pam)}, # #missuse's data.table method
m3 = {split.df(mt, lens)}, # #Damiano Fantini's data frame method
m4 = {split.mat(mt, lens)}) # #Damiano Fantini's matrix method
Unit: milliseconds
expr min lq mean median uq max neval
m1 167.6896 209.7746 251.0932 230.5920 274.9347 555.8839 100
m2 402.3415 497.2397 554.1094 547.9603 599.7632 787.4112 100
m3 552.8548 657.6245 719.2548 711.4123 769.6098 989.6779 100
m4 166.6581 203.6799 249.2965 235.5856 275.4790 547.4927 100
As we can see, m1 and m4 are the fastest, while there are almost no differences between them, which means it is not needed to convert the matrix to a data frame or a data.table especially if the OP will keep working on the matrix. Working directly on the matrix (m1 and m4) should be sufficient.
If you are OK working with data.frames instead of matrices, you might build a grouping factor/vector according to lens and then use split(). Alternatively, use this grouping vector to subset your matrix and return a list. In this example, I wrapped the two solutions into two functions: .
# your data
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
# based on split
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.df(mt, lens)
# based on subsetting
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
split.mat(mt, lens)
This second option is about ~10 times faster than the other one according to microbenchmark
library(microbenchmark)
microbenchmark({split.df(mt, lens)}, times = 1000)
# median = 323.743 microseconds
microbenchmark({split.mat(mt, lens)}, times = 1000)
# median = 31.7645 microseconds
One aproach is using split, however it can operate on vectors and data.frames so you need to convert the matrix - data.table should be efficient
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
pam = rep(1:length(lens), times = lens)
library(data.table)
mt_split <- split(data.table(mt), pam)
mt_split
#output
$`1`
V1 V2
1: 1 11
2: 2 12
$`2`
V1 V2
1: 3 13
2: 4 14
3: 5 15
$`3`
V1 V2
1: 6 16
2: 7 17
3: 8 18
4: 9 19
5: 10 20
Checking speed
mt=matrix(c(1:20000000),ncol=10)
lens=c(20000,15000,(nrow(mt)-20000-15000))
pam = rep(1:length(lens), times = lens)
system.time(split(data.table(mt), pam))
#output
user system elapsed
0.75 0.20 0.96

Subsetting non-NA

I have a matrix in which every row has at least one NA cell, and every column has at least one NA cell as well. What I need is to find the largest subset of this matrix that contains no NAs.
For example, for this matrix A
A <-
structure(c(NA, NA, NA, NA, 2L, NA,
1L, 1L, 1L, 0L, NA, NA,
1L, 8L, NA, 1L, 1L, NA,
NA, 1L, 1L, 6L, 1L, 3L,
NA, 1L, 5L, 1L, 1L, NA),
.Dim = c(6L, 5L),
.Dimnames =
list(paste0("R", 1:6),
paste0("C", 1:5)))
A
C1 C2 C3 C4 C5
R1 NA 1 1 NA NA
R2 NA 1 8 1 1
R3 NA 1 NA 1 5
R4 NA 0 1 6 1
R5 2 NA 1 1 1
R6 NA NA NA 3 NA
There are two solutions (8 cells): A[c(2, 4), 2:5] and A[2:5, 4:5], though finding just one valid solution is enough for my purposes. The dimensions of my actual matrix are 77x132.
Being a noob, I see no obvious way to do this. Could anyone help me with some ideas?
1) optim In this approach we relax the problem to a continuous optimization problem which we solve with optim.
The objective function is f and the input to it is a 0-1 vector whose first nrow(A) entries correspond to rows and whose remaining entries correspond to columns. f uses a matrix Ainf which is derived from A by replacing the NAs with a large negative number and the non-NAs with 1. In terms of Ainf the negative of the number of elements in the rectangle of rows and columns corresponding to x is -x[seq(6)] %*% Ainf %*$ x[-seq(6)] which we minimize as a function of x subject to each component of x lying between 0 and 1.
Although this is a relaxation of the original problem to continuous optimization it seems that we get an integer solution, as desired, anyways.
Actually most of the code below is just to get the starting value. To do that we first apply seriation. This permutes the rows and columns giving a more blocky structure and then in the permuted matrix we find the largest square submatrix.
In the case of the specific A in the question the largest rectangular submatrix happens to be square and the starting values are already sufficiently good that they produce the optimum but we will perform the optimization anyways so it works in general. You can play around with different starting values if you like. For example, change k from 1 to some higher number in largestSquare in which case largestSquare will return k columns giving k starting values which can be used in k runs of optim taking the best.
If the starting values are sufficiently good then this should produce the optimum.
library(seriation) # only used for starting values
A.na <- is.na(A) + 0
Ainf <- ifelse(A.na, -prod(dim(A)), 1) # used by f
nr <- nrow(A) # used by f
f <- function(x) - c(x[seq(nr)] %*% Ainf %*% x[-seq(nr)])
# starting values
# Input is a square matrix of zeros and ones.
# Output is a matrix with k columns such that first column defines the
# largest square submatrix of ones, second defines next largest and so on.
# Based on algorithm given here:
# http://www.geeksforgeeks.org/maximum-size-sub-matrix-with-all-1s-in-a-binary-matrix/
largestSquare <- function(M, k = 1) {
nr <- nrow(M); nc <- ncol(M)
S <- 0*M; S[1, ] <- M[1, ]; S[, 1] <- M[, 1]
for(i in 2:nr)
for(j in 2:nc)
if (M[i, j] == 1) S[i, j] = min(S[i, j-1], S[i-1, j], S[i-1, j-1]) + 1
o <- head(order(-S), k)
d <- data.frame(row = row(M)[o], col = col(M)[o], mx = S[o])
apply(d, 1, function(x) {
dn <- dimnames(M[x[1] - 1:x[3] + 1, x[2] - 1:x[3] + 1])
out <- c(rownames(M) %in% dn[[1]], colnames(M) %in% dn[[2]]) + 0
setNames(out, unlist(dimnames(M)))
})
}
s <- seriate(A.na)
p <- permute(A.na, s)
# calcualte largest square submatrix in p of zeros rearranging to be in A's order
st <- largestSquare(1-p)[unlist(dimnames(A)), 1]
res <- optim(st, f, lower = 0*st, upper = st^0, method = "L-BFGS-B")
giving:
> res
$par
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
$value
[1] -9
$counts
function gradient
1 1
$convergence
[1] 0
$message
[1] "CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL"
2) GenSA Another possibility is to repeat (1) but instead of using optim use GenSA from the GenSA package. It does not require starting values (although you can provide a starting value using the par argument and this might improve the solution in some cases) so the code is considerably shorter but since it uses simulated annealing it can be expected to take substantially longer to run. Using f (and nr and Ainf which f uses) from (1). Below we try it without a starting value.
library(GenSA)
resSA <- GenSA(lower = rep(0, sum(dim(A))), upper = rep(1, sum(dim(A))), fn = f)
giving:
> setNames(resSA$par, unlist(dimnames(A)))
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
> resSA$value
[1] -9
I have a solution, but it doesn't scale very well:
findBiggestSubmatrixNonContiguous <- function(A) {
A <- !is.na(A); ## don't care about non-NAs
howmany <- expand.grid(nr=seq_len(nrow(A)),nc=seq_len(ncol(A)));
howmany <- howmany[order(apply(howmany,1L,prod),decreasing=T),];
for (ri in seq_len(nrow(howmany))) {
nr <- howmany$nr[ri];
nc <- howmany$nc[ri];
rcom <- combn(nrow(A),nr);
ccom <- combn(ncol(A),nc);
comcom <- expand.grid(ri=seq_len(ncol(rcom)),ci=seq_len(ncol(ccom)));
for (comi in seq_len(nrow(comcom)))
if (all(A[rcom[,comcom$ri[comi]],ccom[,comcom$ci[comi]]]))
return(list(ri=rcom[,comcom$ri[comi]],ci=ccom[,comcom$ci[comi]]));
}; ## end for
NULL;
}; ## end findBiggestSubmatrixNonContiguous()
It's based on the idea that if the matrix has a small enough density of NAs, then by searching for the largest submatrices first, you'll be likely to find a solution fairly quickly.
The algorithm works by computing a cartesian product of all counts of rows and counts of columns that could be indexed out of the original matrix to produce the submatrix. The set of pairs of counts is then decreasingly ordered by the size of the submatrix that would be produced by each pair of counts; in other words, ordered by the product of the two counts. It then iterates over these pairs. For each pair, it computes all combinations of row indexes and column indexes that could be taken for that pair of counts, and tries each combination in turn until it finds a submatrix that contains zero NAs. Upon finding such a submatrix, it returns that set of row and column indexes as a list.
The result is guaranteed to be correct because it tries submatrix sizes in decreasing order, so the first one it finds must be the biggest (or tied for the biggest) possible submatrix that satisfies the condition.
## OP's example matrix
A <- data.frame(C1=c(NA,NA,NA,NA,2L,NA),C2=c(1L,1L,1L,0L,NA,NA),C3=c(1L,8L,NA,1L,1L,NA),C4=c(NA,1L,1L,6L,1L,3L),C5=c(NA,1L,5L,1L,1L,NA),row.names=c('R1','R2','R3','R4','R5','R6'));
A;
## C1 C2 C3 C4 C5
## R1 NA 1 1 NA NA
## R2 NA 1 8 1 1
## R3 NA 1 NA 1 5
## R4 NA 0 1 6 1
## R5 2 NA 1 1 1
## R6 NA NA NA 3 NA
system.time({ res <- findBiggestSubmatrixNonContiguous(A); });
## user system elapsed
## 0.094 0.000 0.100
res;
## $ri
## [1] 2 3 4
##
## $ci
## [1] 2 4 5
##
A[res$ri,res$ci];
## C2 C4 C5
## R2 1 1 1
## R3 1 1 5
## R4 0 6 1
We see that the function works very quickly on the OP's example matrix, and returns a correct result.
randTest <- function(NR,NC,probNA,seed=1L) {
set.seed(seed);
A <- replicate(NC,sample(c(NA,0:9),NR,prob=c(probNA,rep((1-probNA)/10,10L)),replace=T));
print(A);
print(system.time({ res <- findBiggestSubmatrixNonContiguous(A); }));
print(res);
print(A[res$ri,res$ci,drop=F]);
invisible(res);
}; ## end randTest()
I wrote the above function to make testing easier. We can call it to test a random input matrix of size NR by NC, with a probability of choosing NA in any given cell of probNA.
Here are a few trivial tests:
randTest(8L,1L,1/3);
## [,1]
## [1,] NA
## [2,] 1
## [3,] 4
## [4,] 9
## [5,] NA
## [6,] 9
## [7,] 0
## [8,] 5
## user system elapsed
## 0.016 0.000 0.003
## $ri
## [1] 2 3 4 6 7 8
##
## $ci
## [1] 1
##
## [,1]
## [1,] 1
## [2,] 4
## [3,] 9
## [4,] 9
## [5,] 0
## [6,] 5
randTest(11L,3L,4/5);
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] 2 NA NA
## [5,] NA NA NA
## [6,] 5 NA NA
## [7,] 8 0 4
## [8,] NA NA NA
## [9,] NA NA NA
## [10,] NA 7 NA
## [11,] NA NA NA
## user system elapsed
## 0.297 0.000 0.300
## $ri
## [1] 4 6 7
##
## $ci
## [1] 1
##
## [,1]
## [1,] 2
## [2,] 5
## [3,] 8
randTest(10L,10L,1/3);
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 0 3 8 3 9 1 6 NA
## [2,] 1 NA NA 4 5 8 NA 8 2 NA
## [3,] 4 2 5 3 7 6 6 1 1 5
## [4,] 9 1 NA NA 4 NA NA 1 NA 9
## [5,] NA 7 NA 8 3 NA 5 3 7 7
## [6,] 9 3 1 2 7 NA NA 9 NA 7
## [7,] 0 2 NA 7 NA NA 3 8 2 6
## [8,] 5 0 1 NA 3 3 7 1 NA 6
## [9,] 5 1 9 2 2 5 NA 7 NA 8
## [10,] NA 7 1 6 2 6 9 0 NA 5
## user system elapsed
## 8.985 0.000 8.979
## $ri
## [1] 3 4 5 6 8 9 10
##
## $ci
## [1] 2 5 8 10
##
## [,1] [,2] [,3] [,4]
## [1,] 2 7 1 5
## [2,] 1 4 1 9
## [3,] 7 3 3 7
## [4,] 3 7 9 7
## [5,] 0 3 1 6
## [6,] 1 2 7 8
## [7,] 7 2 0 5
I don't know an easy way of verifying if the above result is correct, but it looks good to me. But it took almost 9 seconds to generate this result. Running the function on moderately larger matrices, especially a 77x132 matrix, is probably a lost cause.
Waiting to see if someone can come up with a brilliant efficient solution...

How to test whether a matrix is empty

I created an empty matrix by matrix(), when I need to test whether a given matrix is empty, How can I do that? I know that is.na(matrix()) is TRUE, but if given matrix is higher dimension, it cannot determine.
What I mean empty is element full of NA or NULL.
I'm guessing that you are just looking for all. Here's a small example:
M1 <- matrix(NA, ncol = 3, nrow = 3)
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] NA NA NA
# [3,] NA NA NA
M2 <- matrix(c(1, rep(NA, 8)), ncol = 3, nrow = 3)
M2
# [,1] [,2] [,3]
# [1,] 1 NA NA
# [2,] NA NA NA
# [3,] NA NA NA
all(is.na(M1))
# [1] TRUE
all(is.na(M2))
# [1] FALSE

Need to vectorize function that using loop (replace NA rows with values from vector)

How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9

Resources