I have a matrix in which every row has at least one NA cell, and every column has at least one NA cell as well. What I need is to find the largest subset of this matrix that contains no NAs.
For example, for this matrix A
A <-
structure(c(NA, NA, NA, NA, 2L, NA,
1L, 1L, 1L, 0L, NA, NA,
1L, 8L, NA, 1L, 1L, NA,
NA, 1L, 1L, 6L, 1L, 3L,
NA, 1L, 5L, 1L, 1L, NA),
.Dim = c(6L, 5L),
.Dimnames =
list(paste0("R", 1:6),
paste0("C", 1:5)))
A
C1 C2 C3 C4 C5
R1 NA 1 1 NA NA
R2 NA 1 8 1 1
R3 NA 1 NA 1 5
R4 NA 0 1 6 1
R5 2 NA 1 1 1
R6 NA NA NA 3 NA
There are two solutions (8 cells): A[c(2, 4), 2:5] and A[2:5, 4:5], though finding just one valid solution is enough for my purposes. The dimensions of my actual matrix are 77x132.
Being a noob, I see no obvious way to do this. Could anyone help me with some ideas?
1) optim In this approach we relax the problem to a continuous optimization problem which we solve with optim.
The objective function is f and the input to it is a 0-1 vector whose first nrow(A) entries correspond to rows and whose remaining entries correspond to columns. f uses a matrix Ainf which is derived from A by replacing the NAs with a large negative number and the non-NAs with 1. In terms of Ainf the negative of the number of elements in the rectangle of rows and columns corresponding to x is -x[seq(6)] %*% Ainf %*$ x[-seq(6)] which we minimize as a function of x subject to each component of x lying between 0 and 1.
Although this is a relaxation of the original problem to continuous optimization it seems that we get an integer solution, as desired, anyways.
Actually most of the code below is just to get the starting value. To do that we first apply seriation. This permutes the rows and columns giving a more blocky structure and then in the permuted matrix we find the largest square submatrix.
In the case of the specific A in the question the largest rectangular submatrix happens to be square and the starting values are already sufficiently good that they produce the optimum but we will perform the optimization anyways so it works in general. You can play around with different starting values if you like. For example, change k from 1 to some higher number in largestSquare in which case largestSquare will return k columns giving k starting values which can be used in k runs of optim taking the best.
If the starting values are sufficiently good then this should produce the optimum.
library(seriation) # only used for starting values
A.na <- is.na(A) + 0
Ainf <- ifelse(A.na, -prod(dim(A)), 1) # used by f
nr <- nrow(A) # used by f
f <- function(x) - c(x[seq(nr)] %*% Ainf %*% x[-seq(nr)])
# starting values
# Input is a square matrix of zeros and ones.
# Output is a matrix with k columns such that first column defines the
# largest square submatrix of ones, second defines next largest and so on.
# Based on algorithm given here:
# http://www.geeksforgeeks.org/maximum-size-sub-matrix-with-all-1s-in-a-binary-matrix/
largestSquare <- function(M, k = 1) {
nr <- nrow(M); nc <- ncol(M)
S <- 0*M; S[1, ] <- M[1, ]; S[, 1] <- M[, 1]
for(i in 2:nr)
for(j in 2:nc)
if (M[i, j] == 1) S[i, j] = min(S[i, j-1], S[i-1, j], S[i-1, j-1]) + 1
o <- head(order(-S), k)
d <- data.frame(row = row(M)[o], col = col(M)[o], mx = S[o])
apply(d, 1, function(x) {
dn <- dimnames(M[x[1] - 1:x[3] + 1, x[2] - 1:x[3] + 1])
out <- c(rownames(M) %in% dn[[1]], colnames(M) %in% dn[[2]]) + 0
setNames(out, unlist(dimnames(M)))
})
}
s <- seriate(A.na)
p <- permute(A.na, s)
# calcualte largest square submatrix in p of zeros rearranging to be in A's order
st <- largestSquare(1-p)[unlist(dimnames(A)), 1]
res <- optim(st, f, lower = 0*st, upper = st^0, method = "L-BFGS-B")
giving:
> res
$par
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
$value
[1] -9
$counts
function gradient
1 1
$convergence
[1] 0
$message
[1] "CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL"
2) GenSA Another possibility is to repeat (1) but instead of using optim use GenSA from the GenSA package. It does not require starting values (although you can provide a starting value using the par argument and this might improve the solution in some cases) so the code is considerably shorter but since it uses simulated annealing it can be expected to take substantially longer to run. Using f (and nr and Ainf which f uses) from (1). Below we try it without a starting value.
library(GenSA)
resSA <- GenSA(lower = rep(0, sum(dim(A))), upper = rep(1, sum(dim(A))), fn = f)
giving:
> setNames(resSA$par, unlist(dimnames(A)))
R1 R2 R3 R4 R5 R6 C1 C2 C3 C4 C5
0 1 1 1 0 0 0 1 0 1 1
> resSA$value
[1] -9
I have a solution, but it doesn't scale very well:
findBiggestSubmatrixNonContiguous <- function(A) {
A <- !is.na(A); ## don't care about non-NAs
howmany <- expand.grid(nr=seq_len(nrow(A)),nc=seq_len(ncol(A)));
howmany <- howmany[order(apply(howmany,1L,prod),decreasing=T),];
for (ri in seq_len(nrow(howmany))) {
nr <- howmany$nr[ri];
nc <- howmany$nc[ri];
rcom <- combn(nrow(A),nr);
ccom <- combn(ncol(A),nc);
comcom <- expand.grid(ri=seq_len(ncol(rcom)),ci=seq_len(ncol(ccom)));
for (comi in seq_len(nrow(comcom)))
if (all(A[rcom[,comcom$ri[comi]],ccom[,comcom$ci[comi]]]))
return(list(ri=rcom[,comcom$ri[comi]],ci=ccom[,comcom$ci[comi]]));
}; ## end for
NULL;
}; ## end findBiggestSubmatrixNonContiguous()
It's based on the idea that if the matrix has a small enough density of NAs, then by searching for the largest submatrices first, you'll be likely to find a solution fairly quickly.
The algorithm works by computing a cartesian product of all counts of rows and counts of columns that could be indexed out of the original matrix to produce the submatrix. The set of pairs of counts is then decreasingly ordered by the size of the submatrix that would be produced by each pair of counts; in other words, ordered by the product of the two counts. It then iterates over these pairs. For each pair, it computes all combinations of row indexes and column indexes that could be taken for that pair of counts, and tries each combination in turn until it finds a submatrix that contains zero NAs. Upon finding such a submatrix, it returns that set of row and column indexes as a list.
The result is guaranteed to be correct because it tries submatrix sizes in decreasing order, so the first one it finds must be the biggest (or tied for the biggest) possible submatrix that satisfies the condition.
## OP's example matrix
A <- data.frame(C1=c(NA,NA,NA,NA,2L,NA),C2=c(1L,1L,1L,0L,NA,NA),C3=c(1L,8L,NA,1L,1L,NA),C4=c(NA,1L,1L,6L,1L,3L),C5=c(NA,1L,5L,1L,1L,NA),row.names=c('R1','R2','R3','R4','R5','R6'));
A;
## C1 C2 C3 C4 C5
## R1 NA 1 1 NA NA
## R2 NA 1 8 1 1
## R3 NA 1 NA 1 5
## R4 NA 0 1 6 1
## R5 2 NA 1 1 1
## R6 NA NA NA 3 NA
system.time({ res <- findBiggestSubmatrixNonContiguous(A); });
## user system elapsed
## 0.094 0.000 0.100
res;
## $ri
## [1] 2 3 4
##
## $ci
## [1] 2 4 5
##
A[res$ri,res$ci];
## C2 C4 C5
## R2 1 1 1
## R3 1 1 5
## R4 0 6 1
We see that the function works very quickly on the OP's example matrix, and returns a correct result.
randTest <- function(NR,NC,probNA,seed=1L) {
set.seed(seed);
A <- replicate(NC,sample(c(NA,0:9),NR,prob=c(probNA,rep((1-probNA)/10,10L)),replace=T));
print(A);
print(system.time({ res <- findBiggestSubmatrixNonContiguous(A); }));
print(res);
print(A[res$ri,res$ci,drop=F]);
invisible(res);
}; ## end randTest()
I wrote the above function to make testing easier. We can call it to test a random input matrix of size NR by NC, with a probability of choosing NA in any given cell of probNA.
Here are a few trivial tests:
randTest(8L,1L,1/3);
## [,1]
## [1,] NA
## [2,] 1
## [3,] 4
## [4,] 9
## [5,] NA
## [6,] 9
## [7,] 0
## [8,] 5
## user system elapsed
## 0.016 0.000 0.003
## $ri
## [1] 2 3 4 6 7 8
##
## $ci
## [1] 1
##
## [,1]
## [1,] 1
## [2,] 4
## [3,] 9
## [4,] 9
## [5,] 0
## [6,] 5
randTest(11L,3L,4/5);
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] 2 NA NA
## [5,] NA NA NA
## [6,] 5 NA NA
## [7,] 8 0 4
## [8,] NA NA NA
## [9,] NA NA NA
## [10,] NA 7 NA
## [11,] NA NA NA
## user system elapsed
## 0.297 0.000 0.300
## $ri
## [1] 4 6 7
##
## $ci
## [1] 1
##
## [,1]
## [1,] 2
## [2,] 5
## [3,] 8
randTest(10L,10L,1/3);
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 0 3 8 3 9 1 6 NA
## [2,] 1 NA NA 4 5 8 NA 8 2 NA
## [3,] 4 2 5 3 7 6 6 1 1 5
## [4,] 9 1 NA NA 4 NA NA 1 NA 9
## [5,] NA 7 NA 8 3 NA 5 3 7 7
## [6,] 9 3 1 2 7 NA NA 9 NA 7
## [7,] 0 2 NA 7 NA NA 3 8 2 6
## [8,] 5 0 1 NA 3 3 7 1 NA 6
## [9,] 5 1 9 2 2 5 NA 7 NA 8
## [10,] NA 7 1 6 2 6 9 0 NA 5
## user system elapsed
## 8.985 0.000 8.979
## $ri
## [1] 3 4 5 6 8 9 10
##
## $ci
## [1] 2 5 8 10
##
## [,1] [,2] [,3] [,4]
## [1,] 2 7 1 5
## [2,] 1 4 1 9
## [3,] 7 3 3 7
## [4,] 3 7 9 7
## [5,] 0 3 1 6
## [6,] 1 2 7 8
## [7,] 7 2 0 5
I don't know an easy way of verifying if the above result is correct, but it looks good to me. But it took almost 9 seconds to generate this result. Running the function on moderately larger matrices, especially a 77x132 matrix, is probably a lost cause.
Waiting to see if someone can come up with a brilliant efficient solution...
Related
I have the following R matrix that is a combination of 2x3 and 3x3 submatrices and it can be more than 2 submatrices with different dimension (e.g. m1xp and m2xp and m3xp where each of m1,m2,m3 <= p)
A2 <- list(rbind(c(1,1,1),c(-1,1,-1)),
rbind(c(-1,1,1),c(1,-1,2),c(2,-1,2)))
library(Matrix)
A2 <- as.matrix(Matrix::bdiag(A2))
Rhs <- matrix(c(0,5,0.5,4),nrow = 4)
beta <- c(rep(1.2,3),c(0.5,0.2,0.1))
> A2
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 0 0 0
[2,] -1 1 -1 0 0 0
[3,] 0 0 0 -1 1 1
[4,] 0 0 0 1 -1 2
[5,] 0 0 0 2 -1 2
I would like to get all the rows indices combination between the first sub-matrix and the 2nd sub-matrix to solve an linear optimization problem. The combination has to be from both submatrices then solve for new beta and then check if the condition Aq %*% beta == Rhs is satisfy, stop. If not, then take another combination. I think below is all the rows combination between the sub-matrices:
A combination as one from the first sub-matrix and one from the second sub-matrix
Aq <- A2[c(1,3),]
Aq <- A2[c(1,4),]
Aq <- A2[c(1,5),]
Aq <- A2[c(2,3),]
Aq <- A2[c(2,4),]
Aq <- A2[c(2,5),]
Then, a combination as one from the first and 2 from the second matrix
Aq <- A2[c(1,3,4),]
Aq <- A2[c(1,3,5),]
Aq <- A2[c(1,4,5),]
Aq <- A2[c(2,3,4),]
Aq <- A2[c(2,3,5),]
Aq <- A2[c(2,4,5),]
Then, a combination as one from the first and 3 from the second matrix
Aq <- A2[c(1,3,4,5),]
Aq <- A2[c(2,3,4,5),]
Then, a combination as 2 from the first and one from the second matrix
Aq <- A2[c(1,2,3),]
Aq <- A2[c(1,2,4),]
Aq <- A2[c(1,2,5),]
Then, a combination as 2 from the first and 2 from the second matrix
Aq <- A2[c(1,2,3,4),]
Aq <- A2[c(1,2,3,5),]
Aq <- A2[c(1,2,4,5),]
Then, a combination as 2 from the first and 3 from the second matrix
Aq <- A2[c(1,2,3,4,5),]
Is there a better way to get all the combinations?
Then I would like to create a loop that choice one on the above combination at a time and check if
if (Aq %*% beta == Rhs) {
break
} else {
TAKE ANOTHER COMBINATION Aq
}
Please note I could have more than 2 submatrices that create the block matrix. Then I have to create all row combinations between from the first, 2nd and 3rd matrix. I am hoping there is easy way to do in R. I have tried grid.expand function but it is not giving me the desired output.
A possible base R approach:
indices1 <- 1:2
indices2 <- 3:5
apply(expand.grid(seq_along(indices1), seq_along(indices2)), 1,
function(x) t(apply(
expand.grid(combn(indices1, x[1], simplify=FALSE),
combn(indices2, x[2], simplify=FALSE)),
1, unlist)))
output:
[[1]]
Var1 Var2
[1,] 1 3
[2,] 2 3
[3,] 1 4
[4,] 2 4
[5,] 1 5
[6,] 2 5
[[2]]
Var11 Var12 Var2
[1,] 1 2 3
[2,] 1 2 4
[3,] 1 2 5
[[3]]
Var1 Var21 Var22
[1,] 1 3 4
[2,] 2 3 4
[3,] 1 3 5
[4,] 2 3 5
[5,] 1 4 5
[6,] 2 4 5
[[4]]
Var11 Var12 Var21 Var22
[1,] 1 2 3 4
[2,] 1 2 3 5
[3,] 1 2 4 5
[[5]]
Var1 Var21 Var22 Var23
[1,] 1 3 4 5
[2,] 2 3 4 5
[[6]]
Var11 Var12 Var21 Var22 Var23
[1,] 1 2 3 4 5
edit: adding a more general version:
#identifying the indices
indices <- split(seq_len(nrow(A2)), max.col(abs(A2) > 0, "first"))
#generating the combinations
apply(expand.grid(lapply(indices, seq_along)), 1L,
function(idx) {
t(apply(
expand.grid(
lapply(seq_along(idx),
function(k) {
combn(indices[[k]], idx[k], simplify=FALSE)
})),
1L, unlist))
})
I have something like this in my dataset and I only want to delete a row if it only has NA's, not if it has at least one value.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
In this example they were able to delete what i want, but when i try to do in the exact same way, it doesn't work.
I've already tried their example:
data[rowSums(is.na(data)) != ncol(data),]
But my row's number don't change like this one.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
My NA's are not characters.if i ask for their class:
class(NA)
[1] "logical"
Do you know another way to ask for these, please?
______UPDATE_____
Maybe I said it wrong.
My problem, and it's why there code is not working
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
Because i have 3 columns with information but after that, is everything NA, like this:
Date Product Code protein fat
2016-01-01 aaa 0001 NA NA
2016-01-01 bbb 0003 NA NA
2016-02-01 ccc 0032 NA NA
So the row is not entirly NA's, only after the 3rd column... But i want to remove the entire row.. (1:5)
Thank you!
First, I would coerce the matrix to a data frame, because this is the typical ("tidy") format to store variables and observations. Then you could use the remove_empty_rows() function from the sjmisc-package:
library(sjmisc)
df <- data.frame(
a = c(1, 1, 4, NA, 4),
b = c(2, NA, 6, NA, 8),
c = c(3, 4, 7, NA, NA)
)
# get row numbers of empty rows
empty_rows(df)
## [1] 4
# remove empty rows
remove_empty_rows(df)
## A tibble: 4 × 3
## a b c
## * <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 1 NA 4
## 3 4 6 7
## 4 4 8 NA
There are also functions for columns: empty_cols() and remove_empty_cols().
If you just want to keep complete cases (rows), use complete.cases():
df[complete.cases(df), ]
## a b c
## 1 1 2 3
## 3 4 6 7
Check if this will work with the updated explanation. It will subset the data.frame to ignore the information columns when checking for NA. I added some additional rows that contain a mix of numbers and NA
df1 <- data.frame(Date=c("2016-01-01", "2016-01-01", "2016-02-01", "2016-03-01", "2016-03-01"),
Product=c("aaa", "bbb", "ccc", "ddd", "eee"),
Code=c("0001", "0003", "0032", "0005", "0007"),
protein=c(NA, NA, NA, 5, NA),
fat=c(NA, NA, NA, NA, 4))
# place any columns you do not want to check for NA in names.info
names.info <- c("Date", "Product", "Code")
names.check <- setdiff(names(df1), names.info)
df1[rowSums(is.na(df1[, names.check])) != length(names.check), ]
Date Product Code protein fat
4 2016-03-01 ddd 0005 5 NA
5 2016-03-01 eee 0007 NA 4
You need to delete the as.integer
mymat <- matrix(c(1:3, NA, 4:6, NA, rep(NA, 4)), ncol = 3)
Which translates to
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
[4,] NA NA NA
mymat[as.integer(rowSums(is.na(mymat)) != ncol(mymat)), ]
Gives you
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 1 4 NA
[3,] 1 4 NA
But you want
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
To get
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
Cheers,
Marc
I Matlab it is possible to add elements vertically to vector even though their index scope far beyond just the next row. For example:
test = 1:5;
test(5,:) = 1:5;
will return.
1 2 3 4 5
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 2 3 4 5
Is there a nice, not so hacky way to do this in R?
Although joran is probably right about altering your procedure when switching to R, you could exploit the fact that indexing a vector results in enlarging too, as Frank noted in the comments. Taking advantage that matrices are vectors with a "dim" attribute and -for convenience- using the by-column storage of matrices, you could use something like:
add_col = function(x, col, value)
{
nr = NROW(x)
nc = if(col > NCOL(x)) col else NCOL(x)
i1 = nr * (col - 1) + 1
i2 = i1 + length(value) - 1
x[i1:i2] = value
length(x) = nr * nc
dim(x) = c(nr, nc)
return(x)
}
test = 1:5
add_col(test, 3, 1:3)
# [,1] [,2] [,3]
#[1,] 1 NA 1
#[2,] 2 NA 2
#[3,] 3 NA 3
#[4,] 4 NA NA
#[5,] 5 NA NA
t(add_col(add_col(test, 3, 1:3), 6, 4:1))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] NA NA NA NA NA
#[3,] 1 2 3 NA NA
#[4,] NA NA NA NA NA
#[5,] NA NA NA NA NA
#[6,] 4 3 2 1 NA
This might be a bit hacky but is not much slower than, just, indexing a vector out of bounds:
X = seq_len(1e5)
microbenchmark::microbenchmark(add_col(X, 1e2, seq_along(X)),
t(add_col(X, 1e2, seq_along(X))), #with a transpose
"[<-"(X, (1e7 - (length(X) - 1)):1e7, seq_along(X)), #just vector out-of-bounds indexing
times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# add_col(X, 100, seq_along(X)) 34.79408 40.02492 45.61020 63.24266 78.48069 50
# t(add_col(X, 100, seq_along(X))) 79.81389 84.06544 87.57906 102.75845 110.72842 50
# `[<-`(X, (1e+07 - (length(X) - 1)):1e+07, seq_along(X)) 17.25123 18.06138 21.48956 24.69084 48.91988 50
identical(c(add_col(X, 1e2, seq_along(X))), "[<-"(X, (1e7 - (length(X) - 1)):1e7, seq_along(X)))
#[1] TRUE
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
At first sight this seems a duplicate of Combine/merge columns while avoiding NA? but in fact it isn't. I am dealing sometimes with more than two columns instead of just two.
My dataframe looks like this:
col1 col2 col3 col4 col5
[1,] 1 NA NA 13 NA
[2,] NA NA 10 NA 18
[3,] NA 7 NA 15 NA
[4,] 4 NA NA 16 NA
Now I want to "collapse" this dataframe into a dataframe with less columns and with removed NA's. In fact I am looking for and "excel way of doing": removing one cell and the whole row will move one cell to the left.
The result in this example case would be:
col1 col2
[1,] 1 13
[2,] 10 18
[3,] 7 15
[4,] 4 16
has anyone an idea about how to do this in R? Many thanks in advance!
You can use apply for this. If df is your dataframe`:
df2 <- apply(df,1,function(x) x[!is.na(x)])
df3 <- data.frame(t(df2))
colnames(df3) <- colnames(df)[1:ncol(df3)]
Output:
# col1 col2
# 1 13
# 10 18
# 7 15
# 4 16
You can use apply and na.exclude
DF
## V1 V2 V3 V4 V5
## 1 1 NA NA 13 NA
## 2 NA NA 10 NA 18
## 3 NA 7 NA 15 NA
## 4 4 NA NA 16 NA
t(apply(DF, 1, na.exclude))
## [,1] [,2]
## [1,] 1 13
## [2,] 10 18
## [3,] 7 15
## [4,] 4 16
If you want to keep the dimensions of the data.frame same, you can use sort with na.last=TRUE instead. This will also take care of cases where you have unequal number of values in different rows.
t(apply(DF, 1, sort, na.last = T))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 13 NA NA NA
## [2,] 10 18 NA NA NA
## [3,] 7 15 NA NA NA
## [4,] 4 16 NA NA NA
This function is a bit long-winded but (1) it will be faster in the long run and (2) it offers a good amount of flexibility:
myFun <- function(inmat, outList = TRUE, fill = NA, origDim = FALSE) {
## Split up the data by row and isolate the non-NA values
myList <- lapply(sequence(nrow(inmat)), function(x) {
y <- inmat[x, ]
y[!is.na(y)]
})
## If a `list` is all that you want, the function stops here
if (isTRUE(outList)) {
myList
} else {
## If you want a matrix instead, it goes on like this
Len <- vapply(myList, length, 1L)
## The new matrix can be either just the number of columns required
## or it can have the same number of columns as the input matrix
if (isTRUE(origDim)) Ncol <- ncol(inmat) else Ncol <- max(Len)
Nrow <- nrow(inmat)
M <- matrix(fill, ncol = Ncol, nrow = Nrow)
M[cbind(rep(sequence(Nrow), Len), sequence(Len))] <-
unlist(myList, use.names=FALSE)
M
}
}
To test it out, let's create a function to make up some dummy data:
makeData <- function(nrow = 10, ncol = 5, pctNA = .8, maxval = 25) {
a <- nrow * ncol
m <- matrix(sample(maxval, a, TRUE), ncol = ncol)
m[sample(a, a * pctNA)] <- NA
m
}
set.seed(1)
m <- makeData(nrow = 5, ncol = 4, pctNA=.6)
m
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] 10 24 NA 18
# [3,] NA 17 NA 25
# [4,] NA 16 10 NA
# [5,] NA 2 NA NA
... and apply it...
myFun(m)
# [[1]]
# integer(0)
#
# [[2]]
# [1] 10 24 18
#
# [[3]]
# [1] 17 25
#
# [[4]]
# [1] 16 10
#
# [[5]]
# [1] 2
myFun(m, outList = FALSE)
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] 10 24 18
# [3,] 17 25 NA
# [4,] 16 10 NA
# [5,] 2 NA NA
## Try also
## myFun(m, outList = FALSE, origDim = TRUE)
And, let's run some timings on bigger data in comparison to the other answers so far:
set.seed(1)
m <- makeData(nrow = 1e5, ncol = 5, pctNA = .75)
## Will return a matrix
funCP <- function(inmat) t(apply(inmat, 1, sort, na.last = T))
system.time(funCP(m))
# user system elapsed
# 9.776 0.000 9.757
## Will return a list in this case
funJT <- function(inmat) apply(inmat, 1, function(x) x[!is.na(x)])
system.time(JT <- funJT(m))
# user system elapsed
# 0.577 0.000 0.575
## Output a list
system.time(AM <- myFun(m))
# user system elapsed
# 0.469 0.000 0.466
identical(JT, AM)
# [1] TRUE
## Output a matrix
system.time(myFun(m, outList=FALSE, origDim=TRUE))
# user system elapsed
# 0.610 0.000 0.612
So, the list output appears slightly faster than #JT85's solution, and the matrix output appears slightly slower. But, compared to using sort row-by-row, this is a definite improvement.