Add months of zero demand to zoo time series - r

I have some intermittent demand data that only includes lines where demand is present. I bring it in via read.csv, and my 2 columns are Date (as date) and Quantity (as integer). Then I convert it to a zoo series and combine the daily demand into monthly demand. My final output is a zoo series with the date being the first day of the month and the summed demand for that month.
My problem is that this zoo series is missing the in between months that have zero demand and I need these to forecast intermittent demand correctly.
For example: I have quantity 2 in date 2013-01-01 and then the next line is quantity 3 in 2013-10-01. I need to add quantity zero to 2013-02-01 through 2013-09-01.
Date <- c('1/1/2013','10/1/2013','11/1/2013')
Quantity <- c('2','3','6')
Date <- as.Date(Date, "%m/%d/%Y")
df <- data.frame(Date, Quantity)
df <- read.zoo(df)
df
The zoo series output:
2013-01-01 2013-10-01 2013-11-01
2 3 6

Because "df" is a zoo object, you may use merge.zoo and its fill argument. The current data set is merged with an empty zoo object which contains all the desired dates.
tt <- seq(min(Date), max(Date), "month")
merge(df, zoo(, tt), fill = 0)
# 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 2013-09-01 2013-10-01 2013-11-01
# 2 0 0 0 0 0 0 0 0 3 6
For further examples, see ?merge.zoo ("extend an irregular series to a regular one").

You can use merge to add the missing rows and then set their values to zero.
First, let's create some fake data:
# Vector of dates from Jan 1, 2015, to Mar 31, 2015
dates = seq(as.Date("2015-01-01"), as.Date("2015-03-31"), by="1 day")
# Let's create data for few of these dates, leaving some out
set.seed(55)
dat = data.frame(dates=dates[sample(1:length(dates), 70)],
quantity=sample(1:10, 70, replace=TRUE))
dat = dat[order(dat$dates),]
Now let's make believe dat is what you imported from a csv file. We want to fill in quantity=0 for the missing dates. So first we need to add rows for the missing dates. You can do this by creating a date vector containing all dates from the first date to the last date in your csv file and using the merge function. In this case, we've already created that date vector above.
Now merge in rows for the missing dates. The new rows will have NA for quantity. We'll change those NAs to zero below.
dat = merge(data.frame(dates), dat, by="dates", all.x=TRUE)
# Set missing values to zero
dat$quantity[is.na(dat$quantity)] = 0
Now you can aggregate by month, convert to a zoo series, etc.

Related

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Best way to input daily data into R to allow further manipulation

I have daily rainfall data in Excel (which I can save as a CSV or txt file) that I would like to manipulate and load into R. I'm very new to R.
The format of the data is such that I have I have the following columns
Year; Month; Rain on day 1 of Month, Rain on Day 2, ... , Rain on day 31;
This means that I have a large array/table. Some data is missing because it wasn't recorded, and some because February 31st, June 31st, etc do not exist.
I would like to analyse things like monthly totals, and their distributions.
What is the best way to input data so it can be easily manipulated, and that I can distinguish between missing data and NULL data (31st Feb)?
Thanks a lot in advance
Several things for you to have a look at. E.g. readxl::read_excel() for reading excel files or Hmisc::monthDays(dates) for determining the number of days for each month in a dates vector.
Anyway, here's one idea as a starter:
# create sample data
set.seed(1)
mat <- matrix(rbinom(5*31, 31, .5), nrow=5)
mat[sample(1:length(mat), 10)] <- NA
df <- data.frame(year=2016, month=1:5, mat)
# reshape data from wide to long format
library(reshape2)
dflong <- melt(df, id.vars = 1:2, variable.name = "day")
# add date column (will be NA if conversion is not possible, i.e. if date does not exists)
dflong$date <- as.Date(with(dflong, paste(year, month, day, sep="-")), format = "%Y-%m-X%e")
# Select only existing dates
dflong <- subset(dflong[order(dflong$month), ], !is.na(date))
# Aggregate: means per month and year (missing values removed)
aggregate(value~year+month, dflong, mean, na.rm=TRUE)
# year month value
# 1 2016 1 15.93548
# 2 2016 2 15.26923
# 3 2016 3 15.10345
# 4 2016 4 15.74074
# 5 2016 5 16.16667

R turn irregular time interval into regular ones using previous numbers

i have an irregular time interval like this
df=data.frame(Date=c("2013-01-08","2013-01-11","2013-01-13","2013-01-21","2013-02-06"), runningtotal=c(800,910,1060,1210,660)
i found through zoo object it can be merged with a regular time interval and fill in 0 as missing values. However, I need to fill in previous value instead, except at month start fill it with 0. So the end output is like this:
date runningtotal
2013-01-01 0
2013-01-02 0
...
2013-01-08 800
2013-01-09 800
2013-01-10 800
2013-01-11 910
2013-01-12 910
2013-01-13 1060
...
2013-02-01 0
And also, does it make sense to fill in value like this for forecasting purpose?
Thanks.
Try approxfun with the constant method. I don't have lubridate and just deal with regular Date objects. For instance:
df<-data.frame(Date=c("2013-01-08","2013-01-11","2013-01-13","2013-01-21","2013-02-06"), runningtotal=c(800,910,1060,1210,660))
df$Date<-as.Date(as.character(df$Date))
#create some new dates
newDates<-seq(df$Date[1],df$Date[5],length.out=10)
intfun<-approxfun(df$Date,df$runningtotal,method="constant",yleft=0,yright=0)
data.frame(newDates,intfun(newDates))
I would use na.locf from zoo package. But You should prepare data before applying it.
## generate a vector of dates
mm <- min(DF$Date)
day(mm) <- 1
seq_dates <- seq.POSIXt(mm,max(DF$Date),by='days')
## add zeros valus for the beging of month
DF <- rbind(DF,data.frame(Date=seq_dates[day(seq_dates)==1],runningtotal=0))
library(zoo)
## merge with the sequence of dates , and apply na.locf for previous values.
na.locf(merge(seq_dates,DF,by=1,all.x=TRUE))
The idea is to apply na.locf that change missing values with the previous non missing values. Merge your data with a sequence of dates(from the first month to the end of dates) will insert missing values.

Creating with time series from a dataset including missing values

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below
Cases Date
15 1/2009
30 3/2010
45 12/2013
I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.
Try installing the plyr library,
install.packages("plyr")
and then to sum duplicated Date2 rows:
library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
data.frame(Cases = sum(x$Cases))
})
> head(mergedData)
Date2 Cases
1 2008-01-01 16352
2 2008-11-01 10
3 2009-01-01 23
4 2009-02-01 138
5 2009-04-01 18
6 2009-06-01 3534
you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA.
if df is your data frame with Date as column of date than create new time series ts and merge as below.
ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month"))
dfwithmisisng <- merge(ts, df, by="Date", all=T)

r - time series padding with NA

Say, if I have a data frame as follows:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
x1 = rnorm(length(Date1)))
where the spacing between each measurement is 1 hour. How would it be possible to pad this data frame with NAs for the rest of the year, where the final solution should have a length of 8760 i.e. hourly measurements for the entire year. I would like to have the DateTime column to extent from 2010-01-01 00:00 to 2010-12-31 23:00, for example, but have the x1 column to be NA for the days that have been added to the original data frame (if that makes sense). I would like to come up with a solution where there can be any number of years i.e. if the data extends from May 2009 to September 2012 then the final solution should have this data set but with the missing times i.e. from January 2009 to December 2012 to be padded with NA's. How can I go about solving this issue?
Create new data frame that contains all hours and then merge both data frames.
df2<-data.frame(DateTime=seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-31 23:00"), by = "hour"))
merge(df2,Dat,all=TRUE)

Resources