I asked the same question here but it was closed as my post has been associated with similar questions although they are not related to my question and don't resolve it.
The dataset:
I have a huge data set saved in a matrix where the number of rows is more that one million with a dozen of columns.
The matrix looks like
data <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, NA, 3, NA, 5, NA, NA, NA, 8, NA, 5, NA, 7, NA, NA, NA), ncol=3)
> data
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
[6,] NA NA NA
[7,] NA NA NA
[8,] NA NA NA
So if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
The question:
I would like to delete "efficiently" consecutive missing values if there are 3 or more in each column for all columns in the matrix. So I would like to delete consecutive na in a column not a row.
I already saw solutions, like this one, for my question but they were too slow for my huge data set. Do you have other suggestions which can achieve the objective efficiently? Additionally, the suggested answers (1 & 2) for my closed question are deleting if the missing values are consecutive in rows not columns.
EDIT:
Following to the comment below, the output must be like this:
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
EDIT:
> data
[,1] [,2] [,3] [,4]
[1,] 1 1 8 NA
[2,] NA NA NA NA
[3,] 2 3 5 NA
[4,] NA NA NA NA
[5,] 1 5 7 NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
The expected output
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
If it is consecutive, then may be rle can be used
i1 <- rowSums(is.na(data)) > 0
# // or just forgot to update here
i1 <- is.na(data[,1])
data[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
-output
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
Update
If we have a particular column with all NAs, then we can remove it first
data1 <- data[,colSums(!is.na(data)) != 0]
and now we apply the previous code on the selected column data
i1 <- is.na(data1[,1])
data1[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
Or we may use rleid from data.table (which would be more efficient)
library(data.table)
data[as.data.table(data)[, .I[!(.N >=3 & is.na(V1))],
rleid(is.na(V1))]$V1,]
if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
I think this is very important information, we can take advantage of it and work only with any 1 column instead of complete dataset. Try :
vec <- data[, 1]
data[!with(rle(is.na(vec)), rep(values & lengths >= 3, lengths)), ]
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
Related
If I have 2 square Matrices with random NA values, for example:
Matrix A:
1 2 3
1 5 NA 7
2 NA 3 8
3 NA 4 5
Matrix B:
1 2 3
1 NA 8 NA
2 2 5 9
3 NA 4 3
What is the best way to multiply them? Would changing NA values to 0 give a different result of the dot product?
NAs will be ignored:
## Dummy matrices
mat1 <- matrix(sample(1:9, 9), 3, 3)
mat2 <- matrix(sample(1:9, 9), 3, 3)
## Adding NAs
mat1[sample(1:9, 4)] <- NA
mat2[sample(1:9, 4)] <- NA
mat1
# [,1] [,2] [,3]
#[1,] 9 NA 3
#[2,] 2 NA NA
#[3,] NA 1 8
mat2
# [,1] [,2] [,3]
#[1,] NA NA 4
#[2,] NA 9 3
#[3,] NA 7 1
mat1 * mat2
# [,1] [,2] [,3]
#[1,] NA NA 12
#[2,] NA NA NA
#[3,] NA 7 8
mat1 %*% mat2
# [,1] [,2] [,3]
#[1,] NA NA NA
#[2,] NA NA NA
#[3,] NA NA NA
In this case the dot product results in only NAs because there are no operations that does not involve an NA. Different matrices can lead to different results.
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 7 years ago.
Thanks for taking your time to look at my problem! Am new to the forum and relatively new to R, but I'll do my best to formulate the question clearly.
I have a big set of tree sample data with an irregular number of rows per individual. In the "class" variable column (here column 2) the first row of each individual has a value (1, 2, 3 or 4) and the subsequent ones are NA.
I'd like to assign the first value of each individual to the respective subsequent NA-cells (belonging to the same individual).
Reproducible example dataframe (edited):
test <- cbind(c(1, 2, 3, NA, 4, NA, NA, NA, 5, NA, 6), c(3, 4, 3, NA, 1, NA, NA, NA, 2, NA, 1))
colnames(test) <- c("ID", "class")
ID class
[1,] 1 3
[2,] 2 4
[3,] 3 3
[4,] NA NA
[5,] 4 1
[6,] NA NA
[7,] NA NA
[8,] NA NA
[9,] 5 2
[10,] NA NA
[11,] 6 1
The result I am looking for is this:
ID class
[1,] 1 3
[2,] 2 4
[3,] 3 3
[4,] NA 3
[5,] 4 1
[6,] NA 1
[7,] NA 1
[8,] NA 1
[9,] 5 2
[10,] NA 2
[11,] 6 1
I copied the last solution from this topic How to substitute several NA with values within the DF using if-else in R?
and tried to adapt it to my needs like this
test2 <- as.data.frame(t(apply(test["class"], 1, function(x)
if(is.na(x[1]) == FALSE && all(is.na(head(x[1], -1)[-1])))
replace(x, is.na(x), x[1]) else x)))
but it gives me the error "dim(x) must have positive length". I tried many other versions and it gives me all sorts of errors, I don't even know where to start. How can I improve it?
Here's a little one-line function that'll work, in case you don't want to load another package:
rollForward <- function(x) {
c(NA, x[!is.na(x)])[cumsum(!is.na(x)) + 1]
}
test[,"class"] <- rollForward(test[,"class"])
test
# ID class
# [1,] 1 3
# [2,] 2 4
# [3,] 3 3
# [4,] NA 3
# [5,] 4 1
# [6,] NA 1
# [7,] NA 1
# [8,] NA 1
# [9,] 5 2
# [10,] NA 2
# [11,] 6 1
I have a large matrix which comprises 1,2 and missing (coded as NA) values. The matrix has 500000 rows by 10000 columns. There are approximately 0.05% 1- or 2-values, and the remaining values are NA.
I would like to reorder the rows and columns of the matrix so that the top left corner of the matrix comprises a relatively high number of 1s and 2s compared to the rest of the matrix. In other words, I would like to create a relatively datarich subset of the matrix, by reordering the matrix rows and columns.
Is there an efficient method of achieving this in R, perhaps using a library? I would also be interested in solutions in Python or Java, but I would prefer to perform this in R as it is the language that's most familiar to me.
I thought that there maybe a set of optimisation procedures that I could use as my working matrix is too large to do the reorganisation by eye.
I have reverted a set of edits so that the question remains consistent with the current answers.
Like this?
#some sparse data
set.seed(42)
p <- 0.0005
mat <- matrix(sample(c(1, 2, NA), 1e4, TRUE, c(p/2, p/2, 1-p)), ncol=50)
#order columns and rows by the number of NA values in them
mat <- mat[order(rowSums(is.na(mat))), order(colSums(is.na(mat)))]
#only show columns and rows containing non-NA values
mat[rowSums(!is.na(mat)) > 0, colSums(!is.na(mat)) > 0]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] NA NA NA NA 2 NA
# [2,] NA NA NA NA NA 2
# [3,] NA NA 2 NA NA NA
# [4,] NA 1 NA NA NA NA
# [5,] 1 NA NA NA NA NA
# [6,] NA NA NA 2 NA NA
Something like this?
Rgames> bar
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] 1 NA NA NA 3
[3,] NA NA NA NA NA
[4,] 2 NA NA NA 4
[5,] NA NA NA NA NA
Rgames> rab<-bar[order(bar[,1]),]
Rgames> rab
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA 3
[2,] 2 NA NA NA 4
[3,] NA NA NA NA NA
[4,] NA NA NA NA NA
[5,] NA NA NA NA NA
Rgames> rab[,order(rab[1,])]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 NA NA NA
[2,] 2 4 NA NA NA
[3,] NA NA NA NA NA
[4,] NA NA NA NA NA
[5,] NA NA NA NA NA
EDIT, as Roland pointed out, in a more general situation that won't come close. Now, if one is allowed to 'jumble' the rows and columns, this would do it:
for(j in 1:nrow(mfoo)) mat[j,]<-mat[j,order(mat[j,])]
for(j in 1:ncol(mat)) mat[,j]<-mat[order(mat[,j]),j]
I suspect that's not what is desired, so I'll go off and think some more about ordering "criteria"
I have a large matrix which comprises 1,2 and missing (coded as NA) values. The matrix has 500000 rows by 10000 columns. There are approximately 0.05% 1- or 2-values, and the remaining values are NA.
I would like to reorder the rows and columns of the matrix so that the top left corner of the matrix comprises a relatively high number of 1s and 2s compared to the rest of the matrix. In other words, I would like to create a relatively datarich subset of the matrix, by reordering the matrix rows and columns.
Is there an efficient method of achieving this in R?
In particular, I'm interested in solutions where sorting by the number of non-NA values in each row and column is not sufficient to produce a dense corner.
In addition, I'll add a constraint. The size of the dense corner will be pre-defined.
In the following example, the goal is to reorder the rows and columns so that the top leftmost 3x3 submatrix is relatively dense (i.e. few or no NA values).
m1 <- matrix(c(rep(c(rep(NA, 3), rep(1, 7)), 1),
rep(c(rep(2, 3), rep(NA, 7)), 7),
rep(c(rep(NA, 3), rep(1, 7)), 2)
), nrow=10, byrow=TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA NA NA 1 1 1 1 1 1 1
[2,] 2 2 2 NA NA NA NA NA NA NA
[3,] 2 2 2 NA NA NA NA NA NA NA
[4,] 2 2 2 NA NA NA NA NA NA NA
[5,] 2 2 2 NA NA NA NA NA NA NA
[6,] 2 2 2 NA NA NA NA NA NA NA
[7,] 2 2 2 NA NA NA NA NA NA NA
[8,] 2 2 2 NA NA NA NA NA NA NA
[9,] NA NA NA 1 1 1 1 1 1 1
[10,] NA NA NA 1 1 1 1 1 1 1
The rows and columns are ordered by the number of non-NA values (using code from an answer below):
m1 <- m1[order(rowSums(is.na(m1))), order(colSums(is.na(m1)))]
However, this does not result in a dense 3x3 top leftmost corner:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA NA NA 1 1 1 1 1 1 1
[2,] NA NA NA 1 1 1 1 1 1 1
[3,] NA NA NA 1 1 1 1 1 1 1
[4,] 2 2 2 NA NA NA NA NA NA NA
[5,] 2 2 2 NA NA NA NA NA NA NA
[6,] 2 2 2 NA NA NA NA NA NA NA
[7,] 2 2 2 NA NA NA NA NA NA NA
[8,] 2 2 2 NA NA NA NA NA NA NA
[9,] 2 2 2 NA NA NA NA NA NA NA
[10,] 2 2 2 NA NA NA NA NA NA NA
I thought that there maybe a set of optimisation procedures that I could implement as my working matrix is too large to do the reorganisation by eye.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Removing empty rows of a data file in R
How would I remove rows from a matrix or data frame where all elements in the row are NA?
So to get from this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] NA NA NA
[3,] 3 8 13
[4,] 4 NA NA
[5,] 5 10 NA
to this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
Because the problem with na.omit is that it removes rows with any NAs and so would give me this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
The best I have been able to do so far is use the apply() function:
> x[apply(x, 1, function(y) !all(is.na(y))),]
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
but this seems quite convoluted (is there something simpler that I am missing?)....
Thanks.
Solutions using rowSums() generally outperform apply() ones:
m <- structure(c( 1, NA, 3, 4, 5,
6, NA, 8, NA, 10,
11, NA, 13, NA, NA),
.Dim = c(5L, 3L))
m[rowSums(is.na(m)) != ncol(m), ]
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
Sweep a test for all(is.na()) across rows, and remove where true. Something like this (untested as you provided no code to generate your data -- dput() is your friend):
R> ind <- apply(X, 1, function(x) all(is.na(x)))
R> X <- X[ !ind, ]