R data.table replace missing value by next non missing value - r

I have a data.table which age column contain missing values and rdate is Date format. I want to replace missing age by finding the next non-missing age and rdate of each horsenum, then calculate the missing age by next non-missing age - ceiling year difference of non-missing rdate and this record' rdate. I assume next non-missing rdate is birthday so I use ceiling year difference. Also, I want to keep rdate.fill as Date format. How to write this in data.table code?
My idea of age.fill is calculate by this way, but I have error
library(lubridateļ¼‰
data[, rdate.fill := ifelse(is.na(age), as.Date(rdate[na.lacf(age)]), NA), by=horsenum]
data[, age.fill := ifelse(is.na(age), ind4- ceiling(time_length(difftime(rdate.fill, rdate, "years"), age), by=horsenum]
input
index rdate horsenum age ind4
1: 14704 2009-03-01 K123 NA 10
2: 14767 2009-03-01 K212 NA 9
3: 39281 2011-10-09 K123 NA 10
4: 39561 2011-10-19 K212 NA 9
5: 74560 2015-04-07 K212 NA 9
6: 77972 2015-09-06 K123 10 NA
7: 79111 2015-10-10 K212 9 NA
8: 84233 2016-03-28 K212 10 NA
structure(list(index = c(14704L, 14767L, 39281L, 39561L, 74560L,
77972L, 79111L, 84233L), rdate = structure(c(14304, 14304, 15256,
15266, 16532, 16684, 16718, 16888), class = "Date"), horsenum = c("K123",
"K212", "K123", "K212", "K212", "K123", "K212", "K212"), age = c(NA,
NA, NA, NA, NA, 10, 9, 10), ind4 = c(10, 9, 10, 9, 9, NA, NA,
NA)), row.names = c(NA, -8L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000002c5512f1ef0>)
output
index rdate horsenum age ind4 rdate.fill age.fill
1: 14704 2009-03-01 K123 NA 10 2015-09-06 3
2: 14767 2009-03-01 K212 NA 9 2015-10-10 2
3: 39281 2011-10-09 K123 NA 10 2015-09-06 6
4: 39561 2011-10-19 K212 NA 9 2015-10-10 5
5: 74560 2015-04-07 K212 NA 9 2015-10-10 8
6: 77972 2015-09-06 K123 10 NA 10
7: 79111 2015-10-10 K212 9 NA 9
8: 84233 2016-03-28 K212 10 NA 10

Not clear to me how age.fill is calculated differently for rows 2 and 4 as compared to rows 1 and 3 respectively. But this should get you closer to your needs:
library(data.table) #data.table_1.12.4
DT[, rdate.fill := nafill(fifelse(is.na(age), as.Date(NA), rdate), "nocb"), horsenum][,
age.fill := fifelse(is.na(age), ind4 - ceiling(lubridate::time_length(difftime(rdate.fill, rdate), "years")), age), horsenum]
output:
index rdate horsenum age ind4 rdate.fill age.fill
1: 14704 2009-03-01 K123 NA 10 2015-09-06 3
2: 14767 2009-03-01 K212 NA 9 2015-10-10 2
3: 39281 2011-10-09 K123 NA 10 2015-09-06 6
4: 39561 2011-10-19 K212 NA 9 2015-10-10 5
5: 74560 2015-04-07 K212 NA 9 2015-10-10 8
6: 77972 2015-09-06 K123 10 NA 2015-09-06 10
7: 79111 2015-10-10 K212 9 NA 2015-10-10 9
8: 84233 2016-03-28 K212 10 NA 2016-03-28 10

data[,age.fill := nafill(age,'nocb'),by=horsenum][,
rdate.fill:=ifelse(is.na(age),rdate[which.min(age.fill==age)],rdate),by=horsenum][,
age.fill:=unclass(age.fill - round((rdate.fill-rdate)/365))
]
index rdate horsenum age ind4 age.fill rdate.fill
1: 14704 2009-03-01 K123 NA 10 3 2015-09-06
2: 14767 2009-03-01 K212 NA 9 2 2015-10-10
3: 39281 2011-10-09 K123 NA 10 6 2015-09-06
4: 39561 2011-10-19 K212 NA 9 5 2015-10-10
5: 74560 2015-04-07 K212 NA 9 8 2015-10-10
6: 77972 2015-09-06 K123 10 NA 10 2015-09-06
7: 79111 2015-10-10 K212 9 NA 9 2015-10-10
8: 84233 2016-03-28 K212 10 NA 10 2016-03-28

Your algorithm systematically underestimates age. For example, horse K212's estimated age on 2015-04-07 (row 5) is 8. However, we know K212's age on 2016-03-28 is 10 (row 8), so K212 must be 9 on 2015-04-07, not 8. Here I address this problem by calculating an estimated birthdate from each non-NA rdate, then calculating the earliest estimated birthdate for each horse.
library(data.table)
data=data.table(index=c(14704L,14767L,39281L,39561L,74560L,77972L,79111L,84233L),rdate=structure(c(14304,14304,15256,15266,16532,16684,16718,16888),class="Date"),horsenum=c("K123","K212","K123","K212","K212","K123","K212","K212"),age=c(NA,NA,NA,NA,NA,10,9,10))
lt = data[!is.na(age),as.POSIXlt(rdate)]
lt$year = lt$year - data[!is.na(age),age]
data[!is.na(age),bday:=as.Date(lt)]
data[,bday:=min(bday,na.rm=T),horsenum]
data[,age.fill:=floor(as.numeric(rdate-bday)/365)]
data[order(index)]
Output:
index rdate horsenum age bday age.fill
1: 14704 2009-03-01 K123 NA 2005-09-06 3
2: 14767 2009-03-01 K212 NA 2006-03-28 2
3: 39281 2011-10-09 K123 NA 2005-09-06 6
4: 39561 2011-10-19 K212 NA 2006-03-28 5
5: 74560 2015-04-07 K212 NA 2006-03-28 9
6: 77972 2015-09-06 K123 10 2005-09-06 10
7: 79111 2015-10-10 K212 9 2006-03-28 9
8: 84233 2016-03-28 K212 10 2006-03-28 10
Note: this algorithm could be improved. Consider K212 is 9 on 2015-10-10 and 10 on 2016-03-28. This means that K212's actual birthday is after 10-10 and before 3-28. Instead of assuming 3-28, we could assume it is halfway between 10-10 and 3-28, or, more specifically, if there is more than one estimated birthdate, calculate both the max and the min possible birthdate for each horse, then find the date that's halfway between max - 1 year and min.

The approach below is slightly different:
It calculates, based on the given ages, the possible 'range' of the birthday from the horse. It then uses this window to calculate the minimum and maximum age a horse can haveon the given rdate.
So, the more infor you have in te horse's age, the smaller the window of possible birthdays, and the bigger the chance that the minimum estimates age equals the maximum estimated age (of they are the same, you know the age of the horse for sure )..
Here we go:
library( data.table )
library( lubridate ) #for the %m+% and %m-% operators
library( intervals ) #to calculate with intervals and find overlaps
library( eeptools ) #for age_calc function; calculating the age, given a date and a birthday (respects leap yaers, etc..)
#read sample data
DT <- fread("
index rdate horsenum age ind4
14704 2009-03-01 K123 NA 10
14767 2009-03-01 K212 NA 9
39281 2011-10-09 K123 NA 10
39561 2011-10-19 K212 NA 9
74560 2015-04-07 K212 NA 9
77972 2015-09-06 K123 10 NA
79111 2015-10-10 K212 9 NA
84233 2016-03-28 K212 10 NA")
#set dates as IDate
DT[, rdate := as.POSIXct(rdate) ]
#set keys
setkey( DT, horsenum, rdate, age )
#calculate bandwidth date of birth (dob) based on age and date
DT[!is.na( age ), dob_min := as.integer( rdate %m-% lubridate::years(age + 1) %m+% lubridate::days(1) ) ]
DT[!is.na( age ), dob_max := as.integer( rdate %m-% lubridate::years(age) ) ]
#function to get get overlap of birthday-intervals
myfun <- function( y ) {
all_intervals <- intervals::Intervals( as.matrix( y ), check_valid = TRUE )
int_min <- all_intervals[1]
for (i in 1:nrow(all_intervals) ) int_min <- interval_intersection( all_intervals[1], all_intervals[i] )
as.data.table( int_min )
}
#get range of possible date of birth for each horsenum
dob_range <- DT[ !is.na(age), myfun( .SD ), by = .(horsenum), .SDcols = c("dob_min", "dob_max") ]
dob_range <- dob_range[, .(horsenum, dob_from = as.POSIXct(V1, origin = "1970-01-01"),
dob_to = as.POSIXct(V2, origin = "1970-01-01"))]
#use found ranges of birthday to estimate ages
#first join dob-ranges by horsenum
DT[ dob_range, `:=`( dob_from = i.dob_from, dob_to = i.dob_to), on = .(horsenum)]
#now calculate ages (minimum and maximum)
DT[, age_min := floor( eeptools::age_calc( as.Date(dob_to), as.Date(rdate), units= "years" ) )]
DT[, age_max := floor( eeptools::age_calc( as.Date(dob_from), as.Date(rdate), units= "years" ) )]
#remove helper columns
DT[, `:=`( dob_min = NULL, dob_max = NULL, dob_from = NULL, dob_to = NULL)]
# index rdate horsenum age ind4 age_min age_max
# 1: 14704 2009-03-01 K123 NA 10 3 4
# 2: 39281 2011-10-09 K123 NA 10 6 7
# 3: 77972 2015-09-06 K123 10 NA 10 10
# 4: 14767 2009-03-01 K212 NA 9 2 3
# 5: 39561 2011-10-19 K212 NA 9 5 6
# 6: 74560 2015-04-07 K212 NA 9 9 9
# 7: 79111 2015-10-10 K212 9 NA 9 9
# 8: 84233 2016-03-28 K212 10 NA 10 10

Related

R: Populate only next two NA values (grouped by individual) [duplicate]

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

Match dates from list of data frames in R

I have a list of 100+ time series dataframes my.list with daily observations for each product in its own data frame. Some values are NA without any record of the date. I would like to update each data frame in this list to show the date and NA if it does not have a record on this date.
Dates:
start = as.Date('2016/04/08')
full <- seq(start, by='1 days', length=10)
Sample Time Series Data:
d1 <- data.frame(Date = seq(start, by ='2 days',length=5), Sales = c(5,10,15,20,25))
d2 <- data.frame(Date = seq(start, by= '1 day', length=10),Sales = c(1, 2, 3,4,5,6,7,8,9,10))
my.list <- list(d1, d2)
I want to merge all full date values into each data frame, and if no match exists then sales is NA:
my.list
[[d1]]
Date Sales
2016-04-08 5
2016-04-09 NA
2016-04-10 10
2016-04-11 NA
2016-04-12 15
2016-04-13 NA
2016-04-14 20
2016-04-15 NA
2016-04-16 25
2016-04-17 NA
[[d2]]
Date Sales
2016-04-08 1
2016-04-09 2
2016-04-10 3
2016-04-11 4
2016-04-12 5
2016-04-13 6
2016-04-14 7
2016-04-15 8
2016-04-16 9
2016-04-17 10
If I understand correctly, the OP wants to update each of the dataframes in my.list to contain one row for each date given in the vector of dates full
Base R
In base R, merge() can be used as already mentioned by Hack-R. However, th answer below expands this to work on all dataframes in the list:
# creat dataframe from vector of full dates
full.df <- data.frame(Date = full)
# apply merge on each dataframe in the list
lapply(my.list, merge, y = full.df, all.y = TRUE)
[[1]]
Date Sales
1 2016-04-08 5
2 2016-04-09 NA
3 2016-04-10 10
4 2016-04-11 NA
5 2016-04-12 15
6 2016-04-13 NA
7 2016-04-14 20
8 2016-04-15 NA
9 2016-04-16 25
10 2016-04-17 NA
[[2]]
Date Sales
1 2016-04-08 1
2 2016-04-09 2
3 2016-04-10 3
4 2016-04-11 4
5 2016-04-12 5
6 2016-04-13 6
7 2016-04-14 7
8 2016-04-15 8
9 2016-04-16 9
10 2016-04-17 10
Caveat
The answer assumes that full covers the overall range of Date of all dataframes in the list.
In order to avoid any mishaps, the overall range of Date can be retrieved from the available data in my.list:
overall_date_range <- Reduce(range, lapply(my.list, function(x) range(x$Date)))
full <- seq(overall_date_range[1], overall_date_range[2], by = "1 days")
Using rbindlist()
Alternatively, the list of dataframes which are identical in structure can be stored in a large dataframe. An additional attribute indicates to which product each row belongs to. The homogeneous structure simplifies subsequent operations.
The code below uses the rbindlist() function from the data.table package to create a large data.table. CJ() (cross join) creates all combinations of dates and product id which is then merged / joined to fill in the missing dates:
library(data.table)
all_products <- rbindlist(my.list, idcol = "product.id")[
CJ(product.id = unique(product.id), Date = seq(min(Date), max(Date), by = "1 day")),
on = .(Date, product.id)]
all_products
product.id Date Sales
1: 1 2016-04-08 5
2: 1 2016-04-09 NA
3: 1 2016-04-10 10
4: 1 2016-04-11 NA
5: 1 2016-04-12 15
6: 1 2016-04-13 NA
7: 1 2016-04-14 20
8: 1 2016-04-15 NA
9: 1 2016-04-16 25
10: 1 2016-04-17 NA
11: 2 2016-04-08 1
12: 2 2016-04-09 2
13: 2 2016-04-10 3
14: 2 2016-04-11 4
15: 2 2016-04-12 5
16: 2 2016-04-13 6
17: 2 2016-04-14 7
18: 2 2016-04-15 8
19: 2 2016-04-16 9
20: 2 2016-04-17 10
Subsequent operations can be grouped by product.id, e.g., to determine the number of valid sales data for each product:
all_products[!is.na(Sales), .(valid.sales.data = .N), by = product.id]
product.id valid.sales.data
1: 1 5
2: 2 10
Or, the totals sales per product:
all_products[, .(total.sales = sum(Sales, na.rm = TRUE)), by = product.id]
product.id total.sales
1: 1 75
2: 2 55
If required for some reason the result can be converted back to a list by
split(all_products, by = "product.id")

Merge multiple date columns into one

I have a data frame that contains several columns with dates
col1<-seq( as.Date("2011-07-01"), by=20, len=10)
col2<-seq( as.Date("2011-09-01"), by=7, len=10)
col3<-seq( as.Date("2011-08-01"), by=1, len=10)
data.frame(col1,col2,col3)
The data frame looks like this:
col1 col2 col3
1 2011-07-01 2011-09-01 2011-08-01
2 2011-07-21 2011-09-08 2011-08-02
3 2011-08-10 2011-09-15 2011-08-03
4 2011-08-30 2011-09-22 2011-08-04
5 2011-09-19 2011-09-29 2011-08-05
6 2011-10-09 2011-10-06 2011-08-06
7 2011-10-29 2011-10-13 2011-08-07
8 2011-11-18 2011-10-20 2011-08-08
9 2011-12-08 2011-10-27 2011-08-09
10 2011-12-28 2011-11-03 2011-08-10
I am trying to merge them into one column so that
A. Only the lowest (earliest) date remains per row and others get ignored
1 2011-07-01
2 2011-07-21
3 2011-08-03
4 2011-08-04
5 2011-08-05
6 2011-08-06
7 2011-08-07
8 2011-08-08
9 2011-08-09
10 2011-08-10
B. Only the highest (latest) date remains per row
1 2011-09-01
2 2011-09-08
3 2011-09-15
4 2011-09-22
5 2011-09-29
6 2011-10-09
7 2011-10-29
8 2011-11-18
9 2011-12-08
10 2011-12-28
The real dataset has NAs so if NA gets encountered it should be ignored unless all columns have a missing value for a particular row, in which case NA will be generated there as well.
Any thoughts?
pmin and pmax are helpful here:
do.call(pmin, dat)
# [1] "2011-07-01" "2011-07-21" "2011-08-03" "2011-08-04" "2011-08-05"
# [6] "2011-08-06" "2011-08-07" "2011-08-08" "2011-08-09" "2011-08-10"
do.call(pmax, dat)
# [1] "2011-09-01" "2011-09-08" "2011-09-15" "2011-09-22" "2011-09-29"
# [6] "2011-10-09" "2011-10-29" "2011-11-18" "2011-12-08" "2011-12-28"
This also works for NA values, like:
do.call(pmin, c(dat, na.rm=TRUE) )
You can also select the specific columns you want to analyse like:
do.call(pmin, c(dat[c("col1","col2","col3")], na.rm=TRUE) )
We can use max.col to find the index of the maximum values in each row, then cbind with row index and get the value per each row, convert to a data.frame
j1 <- sapply(df1, as.numeric)
df2 <- data.frame(Date = df1[cbind(1:nrow(df1),max.col(j1, 'first') )])
df3 <- data.frame(Date = df1[cbind(1:nrow(df1), max.col(-1*j1, "first"))])
df2
# Date
#1 2011-09-01
#2 2011-09-08
#3 2011-09-15
#4 2011-09-22
#5 2011-09-29
#6 2011-10-09
#7 2011-10-29
#8 2011-11-18
#9 2011-12-08
#10 2011-12-28
df3
# Date
#1 2011-07-01
#2 2011-07-21
#3 2011-08-03
#4 2011-08-04
#5 2011-08-05
#6 2011-08-06
#7 2011-08-07
#8 2011-08-08
#9 2011-08-09
#10 2011-08-10
Or another option is
as.Date(apply(df1, 1, min, na.rm = TRUE))
as.Date(apply(df1, 1, max, na.rm = TRUE))
Or with tidyverse
library(tidyverse)
df1 %>%
rowwise() %>%
transmute(col1Max = max(col1, col2, col3), colMin = min(col1, col2, col3))

Fill NA in a time series only to a limited number

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

Resources