Related
I would like cbind the vectors of same dimension using a vector of their names.
For example I would like get from
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
A matrix using cbind(a,b)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
but calling variables from a vector of environment objects names, e.g. vectornames <- c("a","b")
My last try failed on cbind(for(i in vectornames) get(i))
You want to sapply/lapply the get function here. For example:
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
nmes <- c("a", "b")
# Apply get() to each name in the nmes vector
# Then convert the resulting matrix to a data frame
as.data.frame(sapply(nms, get))
a b
1 2 NA
2 5 1
3 NA 3
4 NA 4
5 6 NA
6 NA 8
Technically you can do this using cbind, but it's more awkward:
# Convert the vector of names to a list of vectors
# Then bind those vectors together as columns
do.call(cbind, lapply(nms, get))
We can use mget to 'get' a list, then "loop-unlist" with sapply and function(x) x or [ to create a matrix
sapply(mget(vectornames), \(x) x)
#OR
sapply(mget(vectornames), `[`)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
I want to interpolate multiple NA values in a matrix called, tester.
This is a part of tester with only 1 column of NA values, in the whole 744x6 matrix other columns have multiple as well:
ZONEID TIMESTAMP U10 V10 U100 V100
1 20121022 12:00 -1.324032e+00 -2.017107e+00 -3.278166e+00 -5.880225574
1 20121022 13:00 -1.295168e+00 NA -3.130429e+00 -6.414975148
1 20121022 14:00 -1.285004e+00 NA -3.068829e+00 -7.101699541
1 20121022 15:00 -9.605904e-01 NA -2.332645e+00 -7.478168285
1 20121022 16:00 -6.268261e-01 -3.057278e+00 -1.440209e+00 -8.026791079
I have installed the zoo package and used the code library(zoo). I have tried to use the na.approx function, but it returns on a linear basis:
na.approx(tester)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, rule = 2)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, x = index(tester), na.rm = TRUE, maxgap = Inf)
Afterward I tried:
Lines <- "tester"
library(zoo)
z <- read.zoo(textConnection(Lines), index = 2)[,2]
na.approx(z)
Again I got the same multiple NA values error. I also tried:
z <- zoo(tester)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
Same error.
I must be doing something really stupid, but I would really appreciate your help.
You may apply na.approx only on columns with at least two non-NA values. Here I use colSums on a boolean matrix to find relevant columns.
# create a small matrix
m <- matrix(data = c(NA, 1, 1, 1, 1,
NA, NA, 2, NA, NA,
NA, NA, NA, NA, 2,
NA, NA, NA, 2, 3),
ncol = 5, byrow = TRUE)
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1 1
# [2,] NA NA 2 NA NA
# [3,] NA NA NA NA 2
# [4,] NA NA NA 2 3
library(zoo)
# na.approx on the entire matrix does not work
na.approx(m)
# Error in approx(x[!na], y[!na], xout, ...) :
# need at least two non-NA values to interpolate
# find columns with at least two non-NA values
idx <- colSums(!is.na(m)) > 1
idx
# [1] FALSE FALSE TRUE TRUE TRUE
# interpolate 'TRUE columns' only
m[ , idx] <- na.approx(m[ , idx])
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1.000000 1.0
# [2,] NA NA 2 1.333333 1.5
# [3,] NA NA NA 1.666667 2.0
# [4,] NA NA NA 2.000000 3.0
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I have a matrix, where rows can have NA's for all columns. I want to replace these NA rows with previous row's non-NA value and K-th column.
For example, this matrix:
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
Must be transformed to this non-NA matrix, where we use 2-th column for replacement:
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I wrote a function for this, but using loop:
# replaces rows which contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
cols <- ncol(x)
for (i in 2:nrow(x)) {
if (sum(is.na(x[i - 1, ])) == 0 && sum(is.na(x[i, ])) == cols) {
x[i, ] <- x[i - 1 , k]
}
}
x
}
Seems this function works correct, but I want to avoid these loops. Can anyone advice, how I can do this replacement without using loops?
UPDATE
agstudy suggested it's own vectorized non-loop solution:
na.replace <- function(mat, k){
idx <- which(rowSums(is.na(mat)) == ncol(mat))
mat[idx,] <- mat[ifelse(idx > 1, idx-1, 1), k]
mat
}
But this solution returns different and wrong results, comparing to my solution with loops. Why this happens? Theoretically loop and non-loop solutions are identical.
Try this function. We can replace NA's at any position in a vector.
NA.replace <-function(x) {
i <- cumprod(is.na(x))
x[!!i] <- x[which.min(i)]
if (length(x) > 0L) {
non.na.idx <- which(!is.na(x))
if (is.na(x[1L])) {
non.na.idx <- c(1L, non.na.idx)
}
rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
}
}
NA.replace(c(NA, 1, 2, NA, NA, 3, NA, NA, 4, NA))
# [1] 1 1 2 2 2 3 3 3 4 4
I'd use the na.locf function in a loop that simply uses the next column to generate a vector of replacement values. However, this may not be very efficient if your matrix is large.
library(zoo)
m <- cbind(
c(NA, NA, 1, 2, NA, 4, NA, 6, 7, 8),
c(NA, NA, 2, 3, NA, 5, NA, 7, 8, 9)
)
m[, ncol(m)] <- na.locf(m[, ncol(m)], na.rm=FALSE)
for (i in seq(ncol(m)-1, 1)) {
replacement_values = na.locf(m[, i+1], na.rm=FALSE)
m[is.na(m[, i]), i] <- replacement_values[is.na(m[, i])]
}
EDIT : I completely change the first solution based in na.locf is
Here a new vectorized solution:
idx <- which(rowSums(is.na(mat)) == ncol(mat))
mat[idx,1:2]= mat[ifelse(idx>1,idx-1,1),2]
X..1. X..2.
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 4 5
[7,] 5 5
[8,] 6 7
[9,] 7 8
[10,] 8 9
You can wrap this in a function :
function(mat,k){
idx <- which(rowSums(is.na(mat)) == ncol(mat))
mat[idx,] <- mat[ifelse(idx>1,idx-1,1),k]
}
Finally I realized my own vectorized version. It returns expected output:
na.replace <- function(x, k) {
isNA <- is.na(x[, k])
x[isNA, ] <- na.locf(x[, k], na.rm = F)[isNA]
x
}
UPDATE
Better solution, without any packages
na.lomf <- function(x) {
if (length(x) > 0L) {
non.na.idx <- which(!is.na(x))
if (is.na(x[1L])) {
non.na.idx <- c(1L, non.na.idx)
}
rep.int(x[non.na.idx], diff(c(non.na.idx, length(x) + 1L)))
}
}
na.lomf(c(NA, 1, 2, NA, NA, 3, NA, NA, 4, NA))
# [1] NA 1 2 2 2 3 3 3 4 4
I'd like to remove the rows that got more than 3 consecutive NAs in one column.
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
[7,] NA 8
[8,] NA 5
[9,] NA 6
so I'd have this data
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
I did a research and I tried this code
data[! rowSums(is.na(data)) >3 , ]
but I think this is only used for consecutive NAs in a row.
As mentioned, rle is a good place to start:
is.na.rle <- rle(is.na(data[, 1]))
Since NAs are "bad" only when they come by three or more, we can re-write the values:
is.na.rle$values <- is.na.rle$values & is.na.rle$lengths >= 3
Finally, use inverse.rle to build the vector of indices to filter:
data[!inverse.rle(is.na.rle), ]
You could use rle, or you could do this:
library(data.table)
d = data.table(a = c(1,NA,2,NA,3,4,NA,NA,NA), b = c(1:9))
d[d[, if(.N > 3) {.I[1]} else {.I}, by = cumsum(!is.na(a))]$V1]
# a b
#1: 1 1
#2: NA 2
#3: 2 3
#4: NA 4
#5: 3 5
#6: 4 6
Run d[, cumsum(!is.na(a))] to see why this works. Also, I could've used .SD instead of .I to get cleaner code, but opted for efficiency instead.
As #DirkEddelbuettel suggested, the rle() function will help. You can create your own function to identify the elements of a vector with 3 or more consecutive NA values.
consecna <- function(x, n=3) {
# function to identify elements with n or more consecutive NA values
y <- rle(is.na(x))
y$values <- y$lengths > (n - 0.5) & y$values
inverse.rle(y)
}
Then you can apply this function to each column of your matrix.
# example matrix of data
m <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, 1, 4, 3, 4, 8, 5, 6), ncol=2)
# index matrix identifying elements with 3 or more consecutive NA values
mindex <- apply(m, 2, consecna)
Then use the created index matrix to get rid of all those rows that were identified.
# removal of all the identified rows
m2 <- m[!apply(mindex, 1, any), ]