So, I have the following problem:
I have a data set, A (data.table object), of the following structure:
date days rate
1996-01-02 9 5.763067
1996-01-02 15 5.745902
1996-01-02 50 5.673317
1996-01-02 78 5.608884
1996-01-02 169 5.473762
1996-01-03 9 5.763067
1996-01-03 14 5.747397
1996-01-03 49 5.672263
1996-01-03 77 5.603705
1996-01-03 168 5.470584
1996-01-04 11 5.729460
1996-01-04 13 5.726104
1996-01-04 48 5.664931
1996-01-04 76 5.601891
1996-01-04 167 5.468961
Note that the days column and its size may differ for each day.
My goal is now to (piecewise linearly) interpolate rate along days. I am doing this for each day via
approx(x=A[,days],y=A[,rate],xout=days_vec,rule=2)
where days_vec <- min_days:max_days, i.e. the days range I am interested in (say 1:100).
I have two problems here:
approx only interpolates, i.e. it does not create a linear fit across min(x) and max(x). If I am now interested in days 1:100, I first need to do it by hand using days 9 and 15 (first 2 lines of A) via:
first_days <- 1:(A[1,days]-1) #1:8
rate_vec[first_days] <- A[1,rate] +
(first_days - A[1,days])/(A[2,days]-A[1,days])*(A[2,rate]-A[1,rate])
and then using the approx line above for rate_vec[9:100]. Is there a way of doing this in 1 step?
Right now, given that I need two steps and the shift point between the two procedures (here 9) differs among dates, I cannot see an implementation via data.table, although this would be vastly preferred (using data.table methods to interpolate/extrapolate and then returning the expanded data.table object). Thus, I currently run a for loop through the dates, which is of course much, much slower.
Question: Is the problem above better implementable and also, is this somehow doable with data.table methods instead of looping through A?
How about something like this.
# please try to make a fully reproducible example!
library(data.table)
df <- fread(input=
"date days rate
1996-01-02 9 5.763067
1996-01-02 15 5.745902
1996-01-02 50 5.673317
1996-01-02 78 5.608884
1996-01-02 169 5.473762
1996-01-03 9 5.763067
1996-01-03 14 5.747397
1996-01-03 49 5.672263
1996-01-03 77 5.603705
1996-01-03 168 5.470584
1996-01-04 11 5.729460
1996-01-04 13 5.726104
1996-01-04 48 5.664931
1996-01-04 76 5.601891
1996-01-04 167 5.468961")
df[,date := as.Date(date)]
1. Create NA values of rate for days in range 1:100 that aren't in dataset
df <-
merge(df,
expand.grid( days=1L:100L, # whatever range you are interested in
date=df[,sort(unique(date))] ), # dates with at least one observation
all=TRUE # "outer join" on all common columns (date, days)
)
2. For each value of date, use a linear model to predict NA values of rate.
df[, rate := ifelse(is.na(rate),
predict(lm(rate~days,.SD),.SD), # impute NA w/ lm using available data
rate), # if not NA, don't impute
keyby=date]
Gives you:
head(df,10)
# date days rate
# 1: 1996-01-02 1 5.766787 <- rates for days 1-8 & 10 are imputed
# 2: 1996-01-02 2 5.764987
# 3: 1996-01-02 3 5.763186
# 4: 1996-01-02 4 5.761385
# 5: 1996-01-02 5 5.759585
# 6: 1996-01-02 6 5.757784
# 7: 1996-01-02 7 5.755983
# 8: 1996-01-02 8 5.754183
# 9: 1996-01-02 9 5.763067 <- this rate was given
# 10: 1996-01-02 10 5.750581
If there are values of date without at least two observations of rate, you will probably get an error because you won't have enough points to fit a line.
Alternative: Rolling joins solution for piecewise linear interpolation
This requires rolling joins to left and right, and an average of the two that ignores NA values.
This doesn't do well for extrapolation, though, since it's just a constant (either the first or last obs) outside the indices of observations.
setkey(df, date, days)
df2 <- data.table( # this is your framework of date/days pairs you want to evaluate
expand.grid( date=df[,sort(unique(date))],
days=1L:100L),
key = c('date','days')
)
# average of non-NA values between two vectors
meanIfNotNA <- function(x,y){
(ifelse(is.na(x),0,x) + ifelse(is.na(y),0,y)) /
( as.numeric(!is.na(x)) + as.numeric(!is.na(y)))
}
df3 <- # this is your evaluations for the date/days pairs in df2.
setnames(
df[setnames( df[df2, roll=+Inf], # rolling join Last Obs Carried Fwd (LOCF)
old = 'rate',
new = 'rate_locf'
),
roll=-Inf], # rolling join Next Obs Carried Backwd (NOCB)
old = 'rate',
new = 'rate_nocb'
)[, rate := meanIfNotNA(rate_locf,rate_nocb)]
# once you're satisfied that this works, you can include rate_locf := NULL, etc.
head(df3,10)
# date days rate_nocb rate_locf rate
# 1: 1996-01-02 1 5.763067 NA 5.763067
# 2: 1996-01-02 2 5.763067 NA 5.763067
# 3: 1996-01-02 3 5.763067 NA 5.763067
# 4: 1996-01-02 4 5.763067 NA 5.763067
# 5: 1996-01-02 5 5.763067 NA 5.763067
# 6: 1996-01-02 6 5.763067 NA 5.763067
# 7: 1996-01-02 7 5.763067 NA 5.763067
# 8: 1996-01-02 8 5.763067 NA 5.763067
# 9: 1996-01-02 9 5.763067 5.763067 5.763067 <- this rate was given
# 10: 1996-01-02 10 5.745902 5.763067 5.754485
Related
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
I am new to time series and was hoping someone could provide some input/ideas here.
I am trying to find ways to impute missing values.
I was hoping to find the moving average, but most of the packages (smooth, mgcv, etc.) don't seem to take time intervals into consideration.
For example, the dataset might look like something below and I would want value at 2016-01-10 to have the greatest influence in calculating the missing value:
Date Value Diff_Days
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-01-30 50 16
I have instances where NA might be the first observation or the last observation. Sometimes NA values also occur multiple times, at which point the rolling window would need to expand, and this is why I would like to use the moving average.
Is there a package that would take date intervals / separate weights into consideration?
Or please suggest if there is a better way to impute NA values in such cases.
You can use glm or any different model.
Input
con <- textConnection("Date Value Diff_Days
2015-12-14 NA 0
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-02-14 NA 0
2016-02-18 NA 0
2016-02-29 50 16")
df <- read.table(con, header = T)
df$Date <- as.Date(df$Date)
df$Date.numeric <- as.numeric(df$Date)
fit <- glm(Value ~ Date.numeric, data = df)
df.na <- df[is.na(df$Value),]
predicted <- predict(fit, df.na)
df$Value[is.na(df$Value)] <- predicted
plot(df$Date, df$Value)
points(df.na$Date, predicted, type = "p", col="red")
df$Date.numeric <- NULL
rm(df.na)
print(df)
Output
Date Value Diff_Days
1 2015-12-14 -3.054184 0
2 2016-01-01 10.000000 13
3 2016-01-10 14.000000 4
4 2016-01-14 18.518983 0
5 2016-01-28 30.000000 14
6 2016-02-14 40.092149 0
7 2016-02-18 42.875783 0
8 2016-02-29 50.000000 16
I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)