Related
I asked the same question here but it was closed as my post has been associated with similar questions although they are not related to my question and don't resolve it.
The dataset:
I have a huge data set saved in a matrix where the number of rows is more that one million with a dozen of columns.
The matrix looks like
data <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, NA, 3, NA, 5, NA, NA, NA, 8, NA, 5, NA, 7, NA, NA, NA), ncol=3)
> data
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
[6,] NA NA NA
[7,] NA NA NA
[8,] NA NA NA
So if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
The question:
I would like to delete "efficiently" consecutive missing values if there are 3 or more in each column for all columns in the matrix. So I would like to delete consecutive na in a column not a row.
I already saw solutions, like this one, for my question but they were too slow for my huge data set. Do you have other suggestions which can achieve the objective efficiently? Additionally, the suggested answers (1 & 2) for my closed question are deleting if the missing values are consecutive in rows not columns.
EDIT:
Following to the comment below, the output must be like this:
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
EDIT:
> data
[,1] [,2] [,3] [,4]
[1,] 1 1 8 NA
[2,] NA NA NA NA
[3,] 2 3 5 NA
[4,] NA NA NA NA
[5,] 1 5 7 NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
The expected output
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
If it is consecutive, then may be rle can be used
i1 <- rowSums(is.na(data)) > 0
# // or just forgot to update here
i1 <- is.na(data[,1])
data[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
-output
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
Update
If we have a particular column with all NAs, then we can remove it first
data1 <- data[,colSums(!is.na(data)) != 0]
and now we apply the previous code on the selected column data
i1 <- is.na(data1[,1])
data1[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
Or we may use rleid from data.table (which would be more efficient)
library(data.table)
data[as.data.table(data)[, .I[!(.N >=3 & is.na(V1))],
rleid(is.na(V1))]$V1,]
if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
I think this is very important information, we can take advantage of it and work only with any 1 column instead of complete dataset. Try :
vec <- data[, 1]
data[!with(rle(is.na(vec)), rep(values & lengths >= 3, lengths)), ]
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
I have a list with 535 elements where each of these elements is a 1575x1575 matrix.
Some of the rows and columns are however entirely NAs.
I want to remove these rows and columns and already wrote a line which works when I just apply it for one entry.
But I can't figure out how to apply this apply function for the whole list. covmatrix is my list in this example.
testf <- function(i){
covmatrix[[i]][apply(!is.na(covmatrix[[i]]),2,any),apply(!is.na(covmatrix[[i]]),2,any)]
}
newlist <- lapply(covmatrix, testf)
I get the error code: Error in covmatrix[[i]] : no such Index at Level 1
I guess I do not understand properly how lapply works.
Lets' take the following toy example data:
matlist <- lapply(1:3, function(x) matrix(1:9, ncol = 3))
matlist[[2]][1,] <- NA
matlist[[3]][,1] <- NA
matlist
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> [[2]]
#> [,1] [,2] [,3]
#> [1,] NA NA NA
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> [[3]]
#> [,1] [,2] [,3]
#> [1,] NA 4 7
#> [2,] NA 5 8
#> [3,] NA 6 9
It makes coding a lot easier if we break down the problem into little chunks. For a complex problem, clarity of code is more important than brevity.
First we need a function that will return FALSE if all elements of a vector are NA, and TRUE otherwise:
notallNA <- function(vector) !all(is.na(vector))
Now we write a second function that uses our first function to remove rows and columns that consist purely of NAs from a matrix:
remove_NA <- function(mat) {
valid_rows <- apply(mat, 1, notallNA)
valid_cols <- apply(mat, 2, notallNA)
return(mat[valid_rows, valid_cols])
}
Finally, we can lapply this function to our list of matrices:
lapply(matlist, remove_NA)
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> [[2]]
#> [,1] [,2] [,3]
#> [1,] 2 5 8
#> [2,] 3 6 9
#>
#> [[3]]
#> [,1] [,2]
#> [1,] 4 7
#> [2,] 5 8
#> [3,] 6 9
Note that, although we could squash these two functions into one or two lines of code, and do the whole thing as a lambda inside an lapply, the above code is simpler and easier to read / maintain than:
lapply(matlist, function(x) x[apply(x, 1, function(y) !all(is.na(y))),
apply(x, 2, function(y) !all(is.na(y)))])
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
#>
#> [[2]]
#> [,1] [,2] [,3]
#> [1,] 2 5 8
#> [2,] 3 6 9
#>
#> [[3]]
#> [,1] [,2]
#> [1,] 4 7
#> [2,] 5 8
#> [3,] 6 9
Assume that your list of matrices looks like this
set.seed(100)
ls_of_mat <- replicate(5, matrix(sample(c(NA, 1:10), size = 36, T, c(.7, rep(.3 / 10, 10))), 6), F)
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA 5 NA NA NA NA
[2,] NA NA NA NA NA 9
[3,] NA NA 4 NA 4 NA
[4,] NA NA NA 2 8 10
[5,] NA NA NA NA NA NA
[6,] NA 8 NA 7 NA 8
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA 4 NA NA NA NA
[2,] NA 6 NA NA NA NA
[3,] 1 NA NA NA 10 NA
[4,] NA NA NA NA NA NA
[5,] NA 4 NA NA NA NA
[6,] 3 8 NA NA NA NA
[[3]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA 6 NA 8 NA NA
[2,] 10 NA NA NA NA NA
[3,] NA NA 7 NA NA NA
[4,] NA NA NA NA 4 NA
[5,] 3 9 NA 8 NA 1
[6,] 4 1 7 NA NA 2
Your logic simplifies to
# 1. find non-NA elements
# 2. drop rows and cols with less than one (zero) non-NA element
lapply(ls_of_mat, function(x) {
is_value <- !is.na(x)
x[!rowSums(is_value) < 1L, !colSums(is_value) < 1L]
})
Output
[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 5 NA NA NA NA
[2,] NA NA NA NA 9
[3,] NA 4 NA 4 NA
[4,] NA NA 2 8 10
[5,] 8 NA 7 NA 8
[[2]]
[,1] [,2] [,3]
[1,] NA 4 NA
[2,] NA 6 NA
[3,] 1 NA 10
[4,] NA 4 NA
[5,] 3 8 NA
[[3]]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA 6 NA 8 NA NA
[2,] 10 NA NA NA NA NA
[3,] NA NA 7 NA NA NA
[4,] NA NA NA NA 4 NA
[5,] 3 9 NA 8 NA 1
[6,] 4 1 7 NA NA 2
I am trying to produce a matrix of variable dimensions of the form below (i.e. integers increasing by 1 at a time, with a lower triangle of NAs)
NA 1 2 3 4
NA NA 5 6 7
NA NA NA 8 9
NA NA NA NA 10
NA NA NA NA 11
I have used the below code
sample_vector <- c(1:(total_nodes^2))
sample_matrix <- matrix(sample_vector, nrow=total_nodes, byrow=FALSE)
sample_matrix[lower.tri(sample_matrix, diag = TRUE)] <- NA
However the matrix I get with this method is of the form:
NA 2 3 4 5
NA NA 8 9 10
NA NA NA 14 15
NA NA NA NA 20
NA NA NA NA 25
How about this
total_nodes <- 5
sample_matrix <- matrix(NA, nrow=total_nodes, ncol=total_nodes)
sample_matrix[lower.tri(sample_matrix)]<-1:sum(lower.tri(sample_matrix))
sample_matrix <- t(sample_matrix)
sample_matrix
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 2 3 4
# [2,] NA NA 5 6 7
# [3,] NA NA NA 8 9
# [4,] NA NA NA NA 10
# [5,] NA NA NA NA NA
I'm using the diag function to construct a matrix and upper.tri to turn it into a "target" aas well as a logical indexing tool:
upr5 <- upper.tri(diag(5))
upr5
upr5[upr5] <- 1:sum(upr5)
upr5[upr5==0] <- NA # would otherwise have been zeroes
upr5
[,1] [,2] [,3] [,4] [,5]
[1,] NA 1 2 4 7
[2,] NA NA 3 5 8
[3,] NA NA NA 6 9
[4,] NA NA NA NA 10
[5,] NA NA NA NA NA
Apologies, if the question is too basic. What would be an effective approach/idea (in R) to convert
list(c(1), c(1,2), c(1,2,3), c(1,2,3,4))
to square matrix form
[,1] [,2] [,3] [,4]
[1,] 1 NA NA NA
[2,] 1 2 NA NA
[3,] 1 2 3 NA
[4,] 1 2 3 4
I suppose there is some quick dynamic way to append just the right number of NA values and then convert to a matrix.
Naturally, the size of the (square) matrix can change).
Thanks in advance for your time.
You can use
## create the list
x <- Map(":", 1, 1:4)
ml <- max(lengths(x))
do.call(rbind, lapply(x, "length<-", ml))
# [,1] [,2] [,3] [,4]
# [1,] 1 NA NA NA
# [2,] 1 2 NA NA
# [3,] 1 2 3 NA
# [4,] 1 2 3 4
Or you could do
library(data.table)
as.matrix(unname(rbindlist(lapply(x, as.data.frame.list), fill = TRUE)))
# [,1] [,2] [,3] [,4]
# [1,] 1 NA NA NA
# [2,] 1 2 NA NA
# [3,] 1 2 3 NA
# [4,] 1 2 3 4
And one more for good measure ... Fore!
m <- stringi::stri_list2matrix(x, byrow = TRUE)
mode(m) <- "numeric"
m
# [,1] [,2] [,3] [,4]
# [1,] 1 NA NA NA
# [2,] 1 2 NA NA
# [3,] 1 2 3 NA
# [4,] 1 2 3 4
I am trying to subset a matrix to always get a 3*3 matrix.
For example, the matrix being subset is a<-matrix(1:15,3,5), usually when I subset it using a[0:2,0:2], I get:
[,1] [,2]
[1,] 1 4
[2,] 2 5
But I want to get something like:
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA 1 4
[3,] NA 2 5
Force all your 0's to NAs when you select, as well as any 'out-of-bounds' values:
ro <- 0:2
co <- 0:2
a[replace(ro,ro == 0 | ro > nrow(a),NA),
replace(co,co == 0 | co > ncol(a),NA)]
# [,1] [,2] [,3]
#[1,] NA NA NA
#[2,] NA 1 4
#[3,] NA 2 5
This will even work with combinations of the parts you want missing:
ro <- c(1,0,2)
co <- 0:2
a[replace(ro,ro == 0 | ro > nrow(a),NA),
replace(co,co == 0 | co > ncol(a),NA)]
# [,1] [,2] [,3]
#[1,] NA 1 4
#[2,] NA NA NA
#[3,] NA 2 5
You could create your own padding function to fill in space less than 3x3 by NA values
padmatrix <- function(a, dim=c(3,3)) {
stopifnot(all(dim(a)<=dim))
cbind(rep(NA,dim[2]-ncol(a)), rbind(rep(NA,dim[1]-nrow(a)), a))
}
padmatrix(a[1:2, 1:2])
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] NA 1 4
# [3,] NA 2 5