I'd like to remove the rows that got more than 3 consecutive NAs in one column.
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
[7,] NA 8
[8,] NA 5
[9,] NA 6
so I'd have this data
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
I did a research and I tried this code
data[! rowSums(is.na(data)) >3 , ]
but I think this is only used for consecutive NAs in a row.
As mentioned, rle is a good place to start:
is.na.rle <- rle(is.na(data[, 1]))
Since NAs are "bad" only when they come by three or more, we can re-write the values:
is.na.rle$values <- is.na.rle$values & is.na.rle$lengths >= 3
Finally, use inverse.rle to build the vector of indices to filter:
data[!inverse.rle(is.na.rle), ]
You could use rle, or you could do this:
library(data.table)
d = data.table(a = c(1,NA,2,NA,3,4,NA,NA,NA), b = c(1:9))
d[d[, if(.N > 3) {.I[1]} else {.I}, by = cumsum(!is.na(a))]$V1]
# a b
#1: 1 1
#2: NA 2
#3: 2 3
#4: NA 4
#5: 3 5
#6: 4 6
Run d[, cumsum(!is.na(a))] to see why this works. Also, I could've used .SD instead of .I to get cleaner code, but opted for efficiency instead.
As #DirkEddelbuettel suggested, the rle() function will help. You can create your own function to identify the elements of a vector with 3 or more consecutive NA values.
consecna <- function(x, n=3) {
# function to identify elements with n or more consecutive NA values
y <- rle(is.na(x))
y$values <- y$lengths > (n - 0.5) & y$values
inverse.rle(y)
}
Then you can apply this function to each column of your matrix.
# example matrix of data
m <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, 1, 4, 3, 4, 8, 5, 6), ncol=2)
# index matrix identifying elements with 3 or more consecutive NA values
mindex <- apply(m, 2, consecna)
Then use the created index matrix to get rid of all those rows that were identified.
# removal of all the identified rows
m2 <- m[!apply(mindex, 1, any), ]
Related
I would like cbind the vectors of same dimension using a vector of their names.
For example I would like get from
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
A matrix using cbind(a,b)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
but calling variables from a vector of environment objects names, e.g. vectornames <- c("a","b")
My last try failed on cbind(for(i in vectornames) get(i))
You want to sapply/lapply the get function here. For example:
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
nmes <- c("a", "b")
# Apply get() to each name in the nmes vector
# Then convert the resulting matrix to a data frame
as.data.frame(sapply(nms, get))
a b
1 2 NA
2 5 1
3 NA 3
4 NA 4
5 6 NA
6 NA 8
Technically you can do this using cbind, but it's more awkward:
# Convert the vector of names to a list of vectors
# Then bind those vectors together as columns
do.call(cbind, lapply(nms, get))
We can use mget to 'get' a list, then "loop-unlist" with sapply and function(x) x or [ to create a matrix
sapply(mget(vectornames), \(x) x)
#OR
sapply(mget(vectornames), `[`)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
I have a numerical sequence containing NA. There are at least two continuous values that are not NA, like this:
x
# [1] 2 4 NA NA 1 NA 1 2 NA NA
I try to do this
y[1]=1, y[2]=y[1]*(x[2]/x[1])
When there is NA
y[i]=y[i-1]
g<-function(v){
t=v
for(i in 2:(length(v)-1))
{ if(!(is.na(v[i])))
v[i+1]=v[i]*(t[i+1]/t[i])
else v[i]=v[i-1]
}
}
This function doesn't not work. Besides I try to apply the function to a large matrix by column, it works too slow for g have loops.
Do you have better way to do this?
Since you want y to remain unchanged when the ratio of the xs is missing, you can calculate the ratio of the xs, then fill in the missing values with 1 and keep right on mulitplying using the function cumprod(). Without a for loop, I'm guessing this would be faster than the function you were using.
myfun <- function(x) {
# calculate the ratio of x[i] to x[i-1]
xrat <- x[-1]/x[-length(x)]
# when the ratio is missing, define it as 1
xrat[is.na(xrat)] <- 1
# define y
y <- c(1, cumprod(xrat))
return(y)
}
# your x
x <- c(2, 4, NA, NA, 1, NA, 1, 2, NA, NA)
y <- myfun(x)
The results look like this
cbind(x, y)
x y
[1,] 2 1
[2,] 4 2
[3,] NA 2
[4,] NA 2
[5,] 1 2
[6,] NA 2
[7,] 1 2
[8,] 2 4
[9,] NA 4
[10,] NA 4
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I have a vector with numerical and NA elements.
For example,
data<-c(.4, -1, 1, NA, 8, NA, -.4)
data[complete.cases(data), ]
But what's the function to separate them into different vectors so I can compare them using graphs such as a boxplot and ECDF?
It's not clear what problem you are trying to solve. complete.cases creates a logical vector for selection (if you use it correctly.) You can negate it to get the other ones. You cannot address a vector as you attempted with [ , ] but if 'data' were a dataframe (or a matrix) that would have worked.
data<-c(.4, -1, 1, NA, 8, NA, -.4)
data[complete.cases(data) ]
#[1] 0.4 -1.0 1.0 8.0 -0.4
data[!complete.cases(data) ]
#[1] NA NA
If one is trying to select the non-NA items it might be easier to use !is.na(data) as the selection vector. This is a test case showing it works with matrices as well as data.frames:
> dat <- matrix( sample(c(1,2,NA), 12, rep=TRUE), 3)
> dat
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] NA NA 2 2
[3,] 1 NA 2 1
> dat[ complete.cases(dat), ]
[1] 1 1 1 1
> dat[ ! complete.cases(dat), ]
[,1] [,2] [,3] [,4]
[1,] NA NA 2 2
[2,] 1 NA 2 1
In a matrix, if there is some missing data recorded as NA.
how could I delete rows with NA in the matrix?
can I use na.rm?
na.omit() will take matrices (and data frames) and return only those rows with no NA values whatsoever - it takes complete.cases() one step further by deleting the FALSE rows for you.
> x <- data.frame(c(1,2,3), c(4, NA, 6))
> x
c.1..2..3. c.4..NA..6.
1 1 4
2 2 NA
3 3 6
> na.omit(x)
c.1..2..3. c.4..NA..6.
1 1 4
3 3 6
I think na.rm usually only works within functions, say for the mean function. I would go with complete.cases: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.htm
let's say you have the following 3x3-matrix:
x <- matrix(c(1:8, NA), 3, 3)
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 NA
then you can get the complete cases of this matrix with
y <- x[complete.cases(x),]
> y
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
The complete.cases-function returns a vector of truth values that says whether or not a case is complete:
> complete.cases(x)
[1] TRUE TRUE FALSE
and then you index the rows of matrix x and add the "," to say that you want all columns.
If you want to remove rows that contain NA's you can use apply() to apply a quick function to check each row. E.g., if your matrix is x,
goodIdx <- apply(x, 1, function(r) !any(is.na(r)))
newX <- x[goodIdx,]