Limit na.locf in zoo package - r

I would like to do a last observation carried forward for a variable, but only up to 2 observations. That is, for gaps of data of 3 or more NA, I would only carry the last observation forward for the next 2 observations and leave the rest as NA.
If I do this with the zoo::na.locf, the maxgap parameter implies that if the gap is larger than 2, no NA is replaced. Not even the last 2. Is there any alternative?
x <- c(NA,3,4,5,6,NA,NA,NA,7,8)
zoo::na.locf(x, maxgap = 2) # Doesn't replace the first 2 NAs of after the 6 as the gap of NA is 3.
Desired_output <- c(NA,3,4,5,6,6,6,NA,7,8)

First apply na.locf0 with maxgap = 2 giving x0 and define a grouping variable g using rleid from the data.table package. For each such group use ave to apply keeper which if the group is all NA replaces it with c(1, 1, NA, ..., NA) and otherwise outputs all 1s. Multiply na.locf0(x) by that.
library(data.table)
library(zoo)
mg <- 2
x0 <- na.locf0(x, maxgap = mg)
g <- rleid(is.na(x0))
keeper <- function(x) if (all(is.na(x))) ifelse(seq_along(x) <= mg, 1, NA) else 1
na.locf0(x) * ave(x0, g, FUN = keeper)
## [1] NA 3 4 5 6 6 6 NA 7 8

A solution using base R:
ave(x, cumsum(!is.na(x)), FUN = function(i){ i[1:pmin(length(i), 3)] <- i[1]; i })
# [1] NA 3 4 5 6 6 6 NA 7 8
cumsum(!is.na(x)) groups each run of NAs with most recent non-NA value.
function(i){ i[1:pmin(length(i), 3)] <- i[1]; i } transforms the first two NAs of each group into the leading non-NA value of this group.

Related

Creating New Column with Maximum Value

I've spent a good deal of time looking into this subject, but have not been able to find much. I would like a new column of data titled "Max Region", that gives the name of the column for which the maximum value occurs in each row.
df <- data.frame(Head=c(9, 6, 2, NA), Thorax=c(9, 2, NA, NA), Abdomen=c(NA, NA, 5, 5), Neck=c(4, 3, 5, 2))
# Head Thorax Abdomen Neck
# 9 9 NA 4
# 6 2 NA 3
# 2 NA 5 5
# NA NA 5 2
So far, I've used:
df$MaxRegion <- names(df)[apply(df, 1, which.max)]
However, in the case of a tie, I would really like both columns to result (ie HeadThorax or AbdomenNeck), or just result with "NA". Is this possible with which.max? I've also looked into max.col, but it also doesn't seem to have this function. Thank you so much!
Using the OP's code, if we need to get all the tied max element column names, use %in%(returns FALSE where there are NA) or == on the max, and paste the corresponding names
apply(df, 1, function(x) toString(names(x)[x %in% max(x, na.rm = TRUE)]))
#[1] "Head, Thorax" "Head" "Abdomen, Neck" "Abdomen"
NOTE: which.max returns only the first index of the max value
Another base R option
df$MaxRegion <- mapply(
subset,
list(names(df)),
asplit(df == do.call(pmax, c(df, na.rm = TRUE)), 1)
)
gives
> df
Head Thorax Abdomen Neck MaxRegion
1 9 9 NA 4 Head, Thorax
2 6 2 NA 3 Head
3 2 NA 5 5 Abdomen, Neck
4 NA NA 5 2 Abdomen

Subtract rows with numeric values and ignore NAs

I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.
Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18
So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))

conditional missing value imputation

How do you impute missing values only if it is 2 or less consecutive missing values and leave other missing values as NAs using na.locf in R?
E.g.,
x<-c(2,1,NA,4,4,NA,NA,NA)
The output should be like
2,1,1,4,4,NA,NA,NA
The first NA is imputed by the previous available "1" and last 3 NAs should not be imputed.
na.locf from zoo has a 'maxgap' argument so you can simply do:
library(zoo)
na.locf(x, maxgap = 2, na.rm = FALSE)
[1] 2 1 1 4 4 NA NA NA
We can use rleid from data.table to create groups, use ave to count length of each group and use na.locf only when the the value is NA and length of the group is less than equal to 2.
library(data.table)
library(zoo)
ifelse(ave(x, rleid(x), FUN = length) <= 2 & is.na(x), na.locf(x), x)
#[1] 2 1 1 4 4 NA NA NA

How to interpolate in R but not extrapolate to boundaries

I have a large set of data arranged as countries one axis and years on the other, with observations of crime rate per 100k. Many countries are missing observations, so for example the crime rate for one country might be (sample data):
df <- c(NA, NA, 3, NA, 5, NA)
I can interpolate it with this code:
df_interp <- data.frame(lapply(df,
function(x) na.approx(x, rule = 2)))
But then I get: 3 3 3 4 5 5
and I would like it to become: NA NA 3 4 5 NA
I do not want values extrapolated to the boundaries, only interpolated inside of known observations.
We can use rle to get the lengths and values of adjacent elements that are equal in the logical vector (!is.na(v1)). Change the elements of logical vector values between the first and last TRUE to TRUE to create the 'ind', subset the 'v1' and apply na.approx on that vector.
library(zoo)
ind <- inverse.rle(within.list(rle(!is.na(v1)), {
i1 <- which(values)
values[min(i1):max(i1)] <- TRUE}))
v1[ind] <- na.approx(v1[ind], rule=2)
v1
#[1] NA NA 3 4 5 NA
Or we can find the index of the first and last non-NA elements with which, get the sequence (:) and use na.approx only on those elements
ind2 <- Reduce(`:`,range(which(!is.na(v1))))
v1[ind2] <- na.approx(v1[ind2], rule=2)
v1
#[1] NA NA 3 4 5 NA
data
v1 <- c(NA, NA, 3, NA, 5 , NA)

Modifying dplyr::lag function

I am trying to use the lag function from the dplyr package. However when I give a lag > 0 I want the missing values to be replaced by the first value in x. How can we achieve this
library(dplyr)
x<-c(1,2,3,4)
z<-lag(x,2)
z
## [1] NA NA 1 2
Since you are using the lag function dplyr, there is an argument default. So you can specify that you want x[1] to be the default.
lag(x, 2, default=x[1])
Here's a modified function mylag:
mylag <- function(x, k = 1, ...)
replace(lag(x, k, ...), seq(k), x[1])
x <- 1:4
mylag(x, k = 2)
# [1] 1 1 1 2
May I suggest adapting the function so that it works both ways: for lag and lead (positive AND negative lags).
shift = function(x, lag, fill=FALSE) {
require(dplyr)
switch(sign(lag)/2+1.5,
lead( x, n=abs(lag), default=switch(fill+1, NA, tail(x, 1)) ),
lag( x, n=abs(lag), default=switch(fill+1, NA, head(x, 1)) )
)
}
It has a "fill" argument that automatically fills with first of last value depending on the sign of the lag.
> shift(1:10, -1)
#### [1] 2 3 4 5 6 7 8 9 10 NA
> shift(1:10, +1, fill=TRUE)
#### [1] 1 1 2 3 4 5 6 7 8 9

Resources