conditional missing value imputation

conditional missing value imputation - r

How do you impute missing values only if it is 2 or less consecutive missing values and leave other missing values as NAs using na.locf in R?
E.g.,
x<-c(2,1,NA,4,4,NA,NA,NA)
The output should be like
2,1,1,4,4,NA,NA,NA
The first NA is imputed by the previous available "1" and last 3 NAs should not be imputed.

na.locf from zoo has a 'maxgap' argument so you can simply do:
library(zoo)
na.locf(x, maxgap = 2, na.rm = FALSE)
[1] 2 1 1 4 4 NA NA NA

We can use rleid from data.table to create groups, use ave to count length of each group and use na.locf only when the the value is NA and length of the group is less than equal to 2.
library(data.table)
library(zoo)
ifelse(ave(x, rleid(x), FUN = length) <= 2 & is.na(x), na.locf(x), x)
#[1] 2 1 1 4 4 NA NA NA

Related

Subtract rows with numeric values and ignore NAs

I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.

Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18

So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))

Sum 2 columns, ignore NA, except when both are NA [duplicate]

This question already has answers here:
Using `:=` in data.table to sum the values of two columns in R, ignoring NAs
(2 answers)
Closed 3 years ago.
I try to sum 2 columns with some NA. There are a lot of forum questions like my first question: how to sum and ignore NA, but now I do want it to return NA when both columns have NA in a specific row. This is an example:
df<-data.table(x = c(1,2,NA),
y = c(1,NA,NA))
> df
x y
1 1
2 NA
NA NA
and I want this:
x y final
1 1 2
2 NA 2
NA NA NA
I've tried the following:
df$sum<-rowSums(df[,c("x", "y")], na.rm=TRUE)
df$final<-ifelse (is.na(df$x) && is.na(df$y) , NA,
ifelse (is.na(df$x) | is.na(df$y), df$sum,
ifelse (!is.na(df$x) && !is.na(df$y), df$sum)))
But this doesn't return what I want.. Could someone help me..?
NOTE: Some have said this is a duplicate for the reason that I ask that NA's are ignored, but those questions do not answer my main question: How should 2 x NAget me NA and not 0

I used the following. It gives sums even when there are NAs, but returns NA when all sumed elements are NA.
rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0)

Here are two more options:
ifelse(rowSums(is.na(df)) != ncol(df), rowSums(df, na.rm = TRUE), NA)
#[1] 2 2 NA
and
vals <- rowSums(df, na.rm = TRUE)
NA^(vals == 0) * vals
#[1] 2 2 NA

Limit na.locf in zoo package

I would like to do a last observation carried forward for a variable, but only up to 2 observations. That is, for gaps of data of 3 or more NA, I would only carry the last observation forward for the next 2 observations and leave the rest as NA.
If I do this with the zoo::na.locf, the maxgap parameter implies that if the gap is larger than 2, no NA is replaced. Not even the last 2. Is there any alternative?
x <- c(NA,3,4,5,6,NA,NA,NA,7,8)
zoo::na.locf(x, maxgap = 2) # Doesn't replace the first 2 NAs of after the 6 as the gap of NA is 3.
Desired_output <- c(NA,3,4,5,6,6,6,NA,7,8)

First apply na.locf0 with maxgap = 2 giving x0 and define a grouping variable g using rleid from the data.table package. For each such group use ave to apply keeper which if the group is all NA replaces it with c(1, 1, NA, ..., NA) and otherwise outputs all 1s. Multiply na.locf0(x) by that.
library(data.table)
library(zoo)
mg <- 2
x0 <- na.locf0(x, maxgap = mg)
g <- rleid(is.na(x0))
keeper <- function(x) if (all(is.na(x))) ifelse(seq_along(x) <= mg, 1, NA) else 1
na.locf0(x) * ave(x0, g, FUN = keeper)
## [1] NA 3 4 5 6 6 6 NA 7 8

A solution using base R:
ave(x, cumsum(!is.na(x)), FUN = function(i){ i[1:pmin(length(i), 3)] <- i[1]; i })
# [1] NA 3 4 5 6 6 6 NA 7 8
cumsum(!is.na(x)) groups each run of NAs with most recent non-NA value.
function(i){ i[1:pmin(length(i), 3)] <- i[1]; i } transforms the first two NAs of each group into the leading non-NA value of this group.

RowSums NA + NA gives 0 [duplicate]

This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all

One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10

Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!

Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1

Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).

Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

conditional missing value imputation - r

na.locf from zoo has a 'maxgap' argument so you can simply do: library(zoo) na.locf(x, maxgap = 2, na.rm = FALSE) [1] 2 1 1 4 4 NA NA NA

Related

Subtract rows with numeric values and ignore NAs

Sum 2 columns, ignore NA, except when both are NA [duplicate]

Limit na.locf in zoo package

RowSums NA + NA gives 0 [duplicate]

Select last non-NA value in a row, by row

Categories

Resources