This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 7 years ago.
Thanks for taking your time to look at my problem! Am new to the forum and relatively new to R, but I'll do my best to formulate the question clearly.
I have a big set of tree sample data with an irregular number of rows per individual. In the "class" variable column (here column 2) the first row of each individual has a value (1, 2, 3 or 4) and the subsequent ones are NA.
I'd like to assign the first value of each individual to the respective subsequent NA-cells (belonging to the same individual).
Reproducible example dataframe (edited):
test <- cbind(c(1, 2, 3, NA, 4, NA, NA, NA, 5, NA, 6), c(3, 4, 3, NA, 1, NA, NA, NA, 2, NA, 1))
colnames(test) <- c("ID", "class")
ID class
[1,] 1 3
[2,] 2 4
[3,] 3 3
[4,] NA NA
[5,] 4 1
[6,] NA NA
[7,] NA NA
[8,] NA NA
[9,] 5 2
[10,] NA NA
[11,] 6 1
The result I am looking for is this:
ID class
[1,] 1 3
[2,] 2 4
[3,] 3 3
[4,] NA 3
[5,] 4 1
[6,] NA 1
[7,] NA 1
[8,] NA 1
[9,] 5 2
[10,] NA 2
[11,] 6 1
I copied the last solution from this topic How to substitute several NA with values within the DF using if-else in R?
and tried to adapt it to my needs like this
test2 <- as.data.frame(t(apply(test["class"], 1, function(x)
if(is.na(x[1]) == FALSE && all(is.na(head(x[1], -1)[-1])))
replace(x, is.na(x), x[1]) else x)))
but it gives me the error "dim(x) must have positive length". I tried many other versions and it gives me all sorts of errors, I don't even know where to start. How can I improve it?
Here's a little one-line function that'll work, in case you don't want to load another package:
rollForward <- function(x) {
c(NA, x[!is.na(x)])[cumsum(!is.na(x)) + 1]
}
test[,"class"] <- rollForward(test[,"class"])
test
# ID class
# [1,] 1 3
# [2,] 2 4
# [3,] 3 3
# [4,] NA 3
# [5,] 4 1
# [6,] NA 1
# [7,] NA 1
# [8,] NA 1
# [9,] 5 2
# [10,] NA 2
# [11,] 6 1
Related
I asked the same question here but it was closed as my post has been associated with similar questions although they are not related to my question and don't resolve it.
The dataset:
I have a huge data set saved in a matrix where the number of rows is more that one million with a dozen of columns.
The matrix looks like
data <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, NA, 3, NA, 5, NA, NA, NA, 8, NA, 5, NA, 7, NA, NA, NA), ncol=3)
> data
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
[6,] NA NA NA
[7,] NA NA NA
[8,] NA NA NA
So if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
The question:
I would like to delete "efficiently" consecutive missing values if there are 3 or more in each column for all columns in the matrix. So I would like to delete consecutive na in a column not a row.
I already saw solutions, like this one, for my question but they were too slow for my huge data set. Do you have other suggestions which can achieve the objective efficiently? Additionally, the suggested answers (1 & 2) for my closed question are deleting if the missing values are consecutive in rows not columns.
EDIT:
Following to the comment below, the output must be like this:
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
EDIT:
> data
[,1] [,2] [,3] [,4]
[1,] 1 1 8 NA
[2,] NA NA NA NA
[3,] 2 3 5 NA
[4,] NA NA NA NA
[5,] 1 5 7 NA
[6,] NA NA NA NA
[7,] NA NA NA NA
[8,] NA NA NA NA
The expected output
[,1] [,2] [,3]
[1,] 1 1 8
[2,] NA NA NA
[3,] 2 3 5
[4,] NA NA NA
[5,] 1 5 7
If it is consecutive, then may be rle can be used
i1 <- rowSums(is.na(data)) > 0
# // or just forgot to update here
i1 <- is.na(data[,1])
data[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
-output
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
Update
If we have a particular column with all NAs, then we can remove it first
data1 <- data[,colSums(!is.na(data)) != 0]
and now we apply the previous code on the selected column data
i1 <- is.na(data1[,1])
data1[!inverse.rle(within.list(rle(i1), {
values[values & lengths < 3] <- FALSE})),]
Or we may use rleid from data.table (which would be more efficient)
library(data.table)
data[as.data.table(data)[, .I[!(.N >=3 & is.na(V1))],
rleid(is.na(V1))]$V1,]
if there is a missing value in certain column, then necessarily other columns will have missing values for the same row.
I think this is very important information, we can take advantage of it and work only with any 1 column instead of complete dataset. Try :
vec <- data[, 1]
data[!with(rle(is.na(vec)), rep(values & lengths >= 3, lengths)), ]
# [,1] [,2] [,3]
#[1,] 1 1 8
#[2,] NA NA NA
#[3,] 2 3 5
#[4,] NA NA NA
#[5,] 1 5 7
I want to write a code that finds the length of longest continuous stretch of NA values in a column of a data-frame object.
>> df
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA NA
[6,] 1 NA
[7,] NA 8
[8,] NA NA
[9,] NA 6
# e.g.
>> longestNAstrech(df[,1])
>> 3
>> longestNAstrech(df[,2])
>> 2
# What should be the length of longestNAstrech()?
Using base R we could create a function
longestNAstrech <- function(x) {
with(rle(is.na(x)), max(lengths[values]))
}
longestNAstrech(df[, 1])
#[1] 3
longestNAstrech(df[, 2])
#[1] 2
I have a large data matrix (“trial.matrix”) similar to the one below.
[,1] [,2]
[1,] 3 NA
[2,] 5 NA
[3,] 7 NA
[4,] 9 10
[5,] 11 12
[6,] 13 14
My problem requires that I shuffle some rows of the difference version of this matrix and then reconstruct a matrix from the shuffled difference matrix. When I apply diff(trial.matrix) I get:
[,1] [,2]
[1,] 2 NA
[2,] 2 NA
[3,] 2 NA
[4,] 2 2
[5,] 2 2
To reconstruct the original data frame, I need to use cumsum() or diffinv(), e.g.:
new.df <- diffinv(diff(trial.matrix), xi = t(c(3, 10)))
but this gives:
[,1] [,2]
[1,] 3 10
[2,] 5 NA
[3,] 7 NA
[4,] 9 NA
[5,] 11 NA
[6,] 13 NA
Obviously, the beginning value (“xi”) for column 2 has to be applied starting in row 3 (or 4?). I have a number of columns in the real matrix, some with leading NAs and some without. I need to preserve the leading NAs in the reconstruction. I cannot figure out an easy way to reconstruct the columns with NAs in the difference matrix in a straightforward way.
(For each column I am able to construct two vectors, one containing the first non-NA row, and the other containing the first NA value, but cannot figure out a straightforward way to use these.)
Suggestions appreciated.
You can temporarily replace NAs by zeroes:
trial.matrix <- matrix(c(seq(3,13,by=2),rep(NA,3),10,12,14),ncol=2)
xi <- apply(trial.matrix,2,function(cl) cl[which(!is.na(cl))[1]])
z2 <- diff(trial.matrix)
# temporarily replace NAs in the second column by zeroes:
nas <- which(is.na(z2[,2]))
z2[nas,2] <- 0
new.df <- diffinv(z2,xi = t(xi))
# return NAs
new.df[nas,2] <- NA
# [,1] [,2]
# [1,] 3 NA
# [2,] 5 NA
# [3,] 7 NA
# [4,] 9 10
# [5,] 11 12
# [6,] 13 14
I have a data frame which has some rows with NA entries, I want to find the index of the row and the column at which the entry is NA. I am looping in a nested fashion to do that, and that is taking too long. Is there a quicker way to do it? Thanks.
set.seed(123)
dfrm <- data.frame(a=sample(c(1:5, NA), 25,T), b=sample(c(letters,NA), 25,rep=T)
which(is.na(dfrm), arr.ind=TRUE)
row col
[1,] 4 1
[2,] 5 1
[3,] 8 1
[4,] 11 1
[5,] 16 1
[6,] 20 1
[7,] 21 1
[8,] 24 1
[9,] 6 2
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Removing empty rows of a data file in R
How would I remove rows from a matrix or data frame where all elements in the row are NA?
So to get from this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] NA NA NA
[3,] 3 8 13
[4,] 4 NA NA
[5,] 5 10 NA
to this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
Because the problem with na.omit is that it removes rows with any NAs and so would give me this:
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
The best I have been able to do so far is use the apply() function:
> x[apply(x, 1, function(y) !all(is.na(y))),]
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
but this seems quite convoluted (is there something simpler that I am missing?)....
Thanks.
Solutions using rowSums() generally outperform apply() ones:
m <- structure(c( 1, NA, 3, 4, 5,
6, NA, 8, NA, 10,
11, NA, 13, NA, NA),
.Dim = c(5L, 3L))
m[rowSums(is.na(m)) != ncol(m), ]
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 3 8 13
[3,] 4 NA NA
[4,] 5 10 NA
Sweep a test for all(is.na()) across rows, and remove where true. Something like this (untested as you provided no code to generate your data -- dput() is your friend):
R> ind <- apply(X, 1, function(x) all(is.na(x)))
R> X <- X[ !ind, ]