How to search upwards a column for a value based on whether another column is NA or not? - r

I need to find the previous date for which value is not NA and then also use the value on that row. I have tried to use shift, but I have met a problem because shift works well for row 9 but not for when there are consecutive non-NAs on type, such as on rows 5,6.
dtihave = data.table(date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,NA,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,NA,NA,5,9,NA))
> dtihave
date type value
1: 2020-01-01 1 7
2: 2020-02-01 1 NA
3: 2020-03-03 NA 6
4: 2020-04-02 NA 8
5: 2020-05-09 1 NA
6: 2020-06-10 1 NA
7: 2020-07-18 NA 5
8: 2020-08-23 NA 9
9: 2020-09-09 1 NA
dtiwant = data.table(date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-03", "2020-04-02", "2020-05-09", "2020-06-10", "2020-07-18", "2020-08-23", "2020-09-09")),
type = c(1,1,NA,NA,1,1,NA,NA,1),
value = c(7,NA,6,8,NA,NA,5,9,NA),
iwantdate = c(NA, as.Date("2020-01-01"), NA, NA, as.Date("2020-04-02"), as.Date("2020-04-02"), NA, NA, as.Date("2020-08-23")),
iwantvalue = c(NA,7,NA,NA,8,8,NA,NA,9))
dtiwant[, iwantdate := as.Date(iwantdate, origin = "1970-01-01")]
> dtiwant
date type value iwantdate iwantvalue
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-04-02 8
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9
My current progress using shift, but I need row 6's iwantdate = "2020-04-02". The number of shifts I need to make is unknown, so I can not just use n=2 in shift.
dtprogress = copy(dtihave)
dtprogress[, iwantdate := ifelse(!is.na(type) & is.na(value), shift(date), NA)]
dtprogress[, iwantdate := ifelse(!is.na(type) & !is.na(value), date, iwantdate)]
dtprogress[, iwantdate := as.Date(iwantdate, origin = "1970-01-01")]
> dtprogress
date type value iwantdate
1: 2020-01-01 1 7 2020-01-01
2: 2020-02-01 1 NA 2020-01-01
3: 2020-03-03 NA 6 <NA>
4: 2020-04-02 NA 8 <NA>
5: 2020-05-09 1 NA 2020-04-02
6: 2020-06-10 1 NA 2020-05-09
7: 2020-07-18 NA 5 <NA>
8: 2020-08-23 NA 9 <NA>
9: 2020-09-09 1 NA 2020-08-23

You could do:
dtihave[, idx := cummax((!is.na(value)) * .I) * NA^!is.na(value)][,
c('want_date', 'want_value') := lapply(.SD, '[', idx),
.SDcols = c('date', 'value')][, idx:=NULL]
dtihave
date type value want_date want_value
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-04-02 8
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9
with tidyverse. Hopefully this solves the grouping. ie just add %>%group_by(...) before mutate and you are good to go
dtihave %>%
mutate(val_na = !is.na(value),
idx = nafill(na_if(row_number() * val_na, 0), "locf"),
idx = idx * NA ^ val_na,
date1 = date[idx], value1 = value[idx],
val_na = NULL, idx = NULL)

You can use lag to get previous values, e.g.
library(dplyr)
dtihave %>%
mutate(iwantdate = ifelse(is.na(value), lag(date), NA) %>% as.Date(., origin = "1970-01-01"),
iwantvalue = ifelse(is.na(value), lag(value), NA))
date type value iwantdate iwantvalue
1: 2020-01-01 1 7 <NA> NA
2: 2020-02-01 1 NA 2020-01-01 7
3: 2020-03-03 NA 6 <NA> NA
4: 2020-04-02 NA 8 <NA> NA
5: 2020-05-09 1 NA 2020-04-02 8
6: 2020-06-10 1 NA 2020-05-09 NA
7: 2020-07-18 NA 5 <NA> NA
8: 2020-08-23 NA 9 <NA> NA
9: 2020-09-09 1 NA 2020-08-23 9

Related

How to merge two columns in R?

newdf=data.frame(id=c(1,3,2),admission=c("2020-05-18","2020-04-30","2020-05-08"),
vent=c("mechanical_vent","self_vent","mechanical_vent"))
newdf$admission=as.Date(newdf$admission)
newdf1=data.frame(id=c(1,3,1,2,1,3,2,2),
date=c("2020-05-19","2020-05-02","2020-05-20","2020-05-09","2020-05-21","2020-05-04","2020-05-10","2020-05-11"),
vent=c("self_vent","mechanical_vent","mechanical_vent","mechanical_vent","self_vent","mechanical_vent","mechanical_vent","self_vent"))
newdf1$date=as.Date(newdf1$date)
newdf=newdf %>% group_by(id) %>% bind_rows(newdf,newdf1)
newdf$dates=paste(newdf$admission,newdf$date)
I want to merge admission and date columns as dates. I used paste function but it gives output with NA values. I have attached image of data set herewith. Could you please suggest a method to solve this?
If you want to transfer the dates from admissionto date, where dateis NA, this will work:
newdf %>%
mutate(across(c(admission, date), ~ as.character(.))) %>%
mutate(date = ifelse(is.na(date), admission, date))
We could use pmax:
newdf$dates <- pmax(newdf$admission, newdf$date, na.rm = TRUE)
Output:
id admission vent date dates
<dbl> <date> <chr> <date> <date>
1 1 2020-05-18 mechanical_vent NA 2020-05-18
2 3 2020-04-30 self_vent NA 2020-04-30
3 2 2020-05-08 mechanical_vent NA 2020-05-08
4 1 2020-05-18 mechanical_vent NA 2020-05-18
5 3 2020-04-30 self_vent NA 2020-04-30
6 2 2020-05-08 mechanical_vent NA 2020-05-08
7 1 NA self_vent 2020-05-19 2020-05-19
8 3 NA mechanical_vent 2020-05-02 2020-05-02
9 1 NA mechanical_vent 2020-05-20 2020-05-20
10 2 NA mechanical_vent 2020-05-09 2020-05-09
11 1 NA self_vent 2020-05-21 2020-05-21
12 3 NA mechanical_vent 2020-05-04 2020-05-04
13 2 NA mechanical_vent 2020-05-10 2020-05-10
14 2 NA self_vent 2020-05-11 2020-05-11
You can use coalesce -
library(dplyr)
newdf %>% ungroup %>% mutate(dates = coalesce(admission, date))
# id admission vent date dates
# <dbl> <date> <chr> <date> <date>
# 1 1 2020-05-18 mechanical_vent NA 2020-05-18
# 2 3 2020-04-30 self_vent NA 2020-04-30
# 3 2 2020-05-08 mechanical_vent NA 2020-05-08
# 4 1 2020-05-18 mechanical_vent NA 2020-05-18
# 5 3 2020-04-30 self_vent NA 2020-04-30
# 6 2 2020-05-08 mechanical_vent NA 2020-05-08
# 7 1 NA self_vent 2020-05-19 2020-05-19
# 8 3 NA mechanical_vent 2020-05-02 2020-05-02
# 9 1 NA mechanical_vent 2020-05-20 2020-05-20
#10 2 NA mechanical_vent 2020-05-09 2020-05-09
#11 1 NA self_vent 2020-05-21 2020-05-21
#12 3 NA mechanical_vent 2020-05-04 2020-05-04
#13 2 NA mechanical_vent 2020-05-10 2020-05-10
#14 2 NA self_vent 2020-05-11 2020-05-11

Create a column that assigns value to a row in a dataframe based on an event in another row

I have a dataframe that is structured like the following:
example <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2),
event = c("email","email","email","draw","email","email","draw","email","email","email","email","draw"),
date = c("2020-03-01","2020-06-01","2020-07-15","2020-07-28","2020-08-07","2020-09-01","2020-09-15","2020-05-22","2020-06-15","2020-07-13","2020-07-15","2020-07-31"),
amount = c(NA,NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200))
This is a simplified version of the dataframe. I am trying to create a column that will assign a 1 to the last email before the draw event and a column that will have the amount drawn on the same row as the email. The desired dataframe would look like the following:
desiredResult <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2),
event = c("email","email","email","draw","email","email","draw","email","email","email","email","draw"),
date = c("2020-03-01","2020-06-01","2020-07-15","2020-07-28","2020-08-07","2020-09-01","2020-09-15","2020-05-22","2020-06-15","2020-07-13","2020-07-15","2020-07-31"),
amount = c(NA,NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200),
EmailBeforeDrawFlag = c(NA,NA,1,NA,NA,1,NA,NA,NA,NA,1,NA),
EmailBeforeDrawAmount = c(NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200,NA))
Here is the dplyr solution. When you create new columns, you want to use if_else() in the definition of EmailBeforeDrawFlag to check a condition, and the lead function to look in the previous row for event. EmailBeforeDrawAmount is juts lead(amount).
example %>%
mutate(EmailBeforeDrawFlag = if_else(lead(event) == "draw", 1, NA_real_ ),
EmailBeforeDrawAmount = lead(amount))
id event date amount EmailBeforeDrawFlag EmailBeforeDrawAmount
1 1 email 2020-03-01 NA NA NA
2 1 email 2020-06-01 NA NA NA
3 1 email 2020-07-15 NA 1 10000
4 1 draw 2020-07-28 10000 NA NA
5 1 email 2020-08-07 NA NA NA
6 1 email 2020-09-01 NA 1 1500
7 1 draw 2020-09-15 1500 NA NA
8 2 email 2020-05-22 NA NA NA
9 2 email 2020-06-15 NA NA NA
10 2 email 2020-07-13 NA NA NA
11 2 email 2020-07-15 NA 1 2200
12 2 draw 2020-07-31 2200 NA NA
We could also make use of NA^ to create the column on the lead
library(dplyr)
example %>%
mutate(EmailBeforeDrawFlag = NA^(lead(event != 'draw')),
EmailBeforeDrawAmount = lead(amount))
-output
# id event date amount EmailBeforeDrawFlag EmailBeforeDrawAmount
#1 1 email 2020-03-01 NA NA NA
#2 1 email 2020-06-01 NA NA NA
#3 1 email 2020-07-15 NA 1 10000
#4 1 draw 2020-07-28 10000 NA NA
#5 1 email 2020-08-07 NA NA NA
#6 1 email 2020-09-01 NA 1 1500
#7 1 draw 2020-09-15 1500 NA NA
#8 2 email 2020-05-22 NA NA NA
#9 2 email 2020-06-15 NA NA NA
#10 2 email 2020-07-13 NA NA NA
#11 2 email 2020-07-15 NA 1 2200
#12 2 draw 2020-07-31 2200 NA NA

R: Populate only next two NA values (grouped by individual) [duplicate]

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

R: na.locf not behaving as expected

I am trying to use the na.locf function in a mutate and I am getting a strange answer. The data is ordered desc by date and then if a column is NA gets the result from na.locf and otherwise uses the value in the column. For most of the data, the answer is being returned as expected, but one row is coming back not as the previous non-NA but as the next non-NA. If we order the data by date ascending and use na.rm = F and fromLast = T it works as expected, but I want to understand why the result is not working if date is ordered descending.
The example is as follows:
example = data.frame(Date = factor(c("1/14/15", "1/29/15", "2/3/15",
"2/11/15", "2/15/15", "3/4/15","3/7/15", "3/7/15", "3/11/15",
"3/18/15", "3/21/15", "4/22/15", "4/22/15", "4/23/15", "5/6/15",
"5/13/15", "5/18/15", "5/24/15", "5/26/15", "5/28/15", "5/29/15",
"5/29/15", "6/25/15", "6/25/15","8/6/15", "8/15/15", "8/20/15",
"8/22/15", "8/22/15", "8/29/15")),
Scan = c(1, rep(NA, 21),2,rep(NA,7)),
Hours = c(rep(NA,3), rep(3,3), NA, 2, rep(3,3), NA, 2, 3, 2,
rep(3,5), NA, 2, rep(c(NA, 3),2), 3, NA, 2, 3)
)
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan),
Scan))
The issue in the result is in row 24, the Scan is coming in as 1 rather than 2:
Date Scan Hours date scan_date scan_new
23 3/7/15 NA 0 2015-03-07 <NA> 2
24 3/7/15 NA 2 2015-03-07 <NA> 1
25 3/4/15 NA 3 2015-03-04 <NA> 2
Interestingly, other data with the same date is handled appropriately, for example on line 18-19
Date Scan Hours date scan_date scan_new
18 4/22/15 NA 0 2015-04-22 <NA> 2
19 4/22/15 NA 2 2015-04-22 <NA> 2
For reference as noted above, the following provides the expected answer:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = F, fromLast = T),
Scan))
Date Scan Hours date scan_date scan_new
6 3/4/15 NA 3 2015-03-04 <NA> 2
7 3/7/15 NA 0 2015-03-07 <NA> 2
8 3/7/15 NA 2 2015-03-07 <NA> 2
Can someone tell me why this is behaving this way?
In your first try na.locf(Scan), the leading NAs are removed and the remaining values are recycled to the full length in the ifelse. You can see the results with na.rm = F(or na.locf0, see comments) for reference:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = FALSE),
Scan))
# Date Scan Hours date scan_date scan_new
# 1 8/29/15 NA 3 2015-08-29 <NA> NA
# 2 8/22/15 NA 0 2015-08-22 <NA> NA
# 3 8/22/15 NA 2 2015-08-22 <NA> NA
# 4 8/20/15 NA 3 2015-08-20 <NA> NA
# 5 8/15/15 NA 3 2015-08-15 <NA> NA
# 6 8/6/15 NA 0 2015-08-06 <NA> NA
# 7 6/25/15 2 0 2015-06-25 2015-06-25 2
# 8 6/25/15 NA 3 2015-06-25 <NA> 2
# 9 5/29/15 NA 0 2015-05-29 <NA> 2
# 10 5/29/15 NA 2 2015-05-29 <NA> 2
# 11 5/28/15 NA 3 2015-05-28 <NA> 2
# 12 5/26/15 NA 3 2015-05-26 <NA> 2
# 13 5/24/15 NA 3 2015-05-24 <NA> 2
# 14 5/18/15 NA 3 2015-05-18 <NA> 2
# 15 5/13/15 NA 3 2015-05-13 <NA> 2
# 16 5/6/15 NA 2 2015-05-06 <NA> 2
# 17 4/23/15 NA 3 2015-04-23 <NA> 2
# 18 4/22/15 NA 0 2015-04-22 <NA> 2
# 19 4/22/15 NA 2 2015-04-22 <NA> 2
# 20 3/21/15 NA 3 2015-03-21 <NA> 2
# 21 3/18/15 NA 3 2015-03-18 <NA> 2
# 22 3/11/15 NA 3 2015-03-11 <NA> 2
# 23 3/7/15 NA 0 2015-03-07 <NA> 2
# 24 3/7/15 NA 2 2015-03-07 <NA> 2
# 25 3/4/15 NA 3 2015-03-04 <NA> 2
# 26 2/15/15 NA 3 2015-02-15 <NA> 2
# 27 2/11/15 NA 3 2015-02-11 <NA> 2
# 28 2/3/15 NA 0 2015-02-03 <NA> 2
# 29 1/29/15 NA 0 2015-01-29 <NA> 2
# 30 1/14/15 1 0 2015-01-14 2015-01-14 1

Fill NA in a time series only to a limited number

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

Resources