Creating with time series from a dataset including missing values - r

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below
Cases Date
15 1/2009
30 3/2010
45 12/2013
I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.

Try installing the plyr library,
install.packages("plyr")
and then to sum duplicated Date2 rows:
library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
data.frame(Cases = sum(x$Cases))
})
> head(mergedData)
Date2 Cases
1 2008-01-01 16352
2 2008-11-01 10
3 2009-01-01 23
4 2009-02-01 138
5 2009-04-01 18
6 2009-06-01 3534

you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA.
if df is your data frame with Date as column of date than create new time series ts and merge as below.
ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month"))
dfwithmisisng <- merge(ts, df, by="Date", all=T)

Related

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?
It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.
Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

Calculate Running Difference in Dates as New Dataframe Column

I've searched for several days and am still stumped.
Given a dataset defined by the following:
ids = c("a","b","c")
dates = c(as.Date("2015-01-01"), as.Date("2015-02-01"), as.Date("2015-02-15"))
test = data.frame(ids, dates)
I am trying to dynamically add new columns to the data frame whose values will be the difference between the column date (2015-03-01) and the value in the date column. I would expect the result would look like the following, but with a better column name:
d20150301 = c(59, 28, 14)
result = data.frame(ids, dates, d20150301)
Many thanks in advance.
You can subtract a vector of dates from a single date, so
test$d2015_03_01 <- as.Date('2015-03-01')-test$dates
makes test look like
> test
ids dates d2015_03_01
1 a 2015-01-01 59 days
2 b 2015-02-01 28 days
3 c 2015-02-15 14 days

Best way to input daily data into R to allow further manipulation

I have daily rainfall data in Excel (which I can save as a CSV or txt file) that I would like to manipulate and load into R. I'm very new to R.
The format of the data is such that I have I have the following columns
Year; Month; Rain on day 1 of Month, Rain on Day 2, ... , Rain on day 31;
This means that I have a large array/table. Some data is missing because it wasn't recorded, and some because February 31st, June 31st, etc do not exist.
I would like to analyse things like monthly totals, and their distributions.
What is the best way to input data so it can be easily manipulated, and that I can distinguish between missing data and NULL data (31st Feb)?
Thanks a lot in advance
Several things for you to have a look at. E.g. readxl::read_excel() for reading excel files or Hmisc::monthDays(dates) for determining the number of days for each month in a dates vector.
Anyway, here's one idea as a starter:
# create sample data
set.seed(1)
mat <- matrix(rbinom(5*31, 31, .5), nrow=5)
mat[sample(1:length(mat), 10)] <- NA
df <- data.frame(year=2016, month=1:5, mat)
# reshape data from wide to long format
library(reshape2)
dflong <- melt(df, id.vars = 1:2, variable.name = "day")
# add date column (will be NA if conversion is not possible, i.e. if date does not exists)
dflong$date <- as.Date(with(dflong, paste(year, month, day, sep="-")), format = "%Y-%m-X%e")
# Select only existing dates
dflong <- subset(dflong[order(dflong$month), ], !is.na(date))
# Aggregate: means per month and year (missing values removed)
aggregate(value~year+month, dflong, mean, na.rm=TRUE)
# year month value
# 1 2016 1 15.93548
# 2 2016 2 15.26923
# 3 2016 3 15.10345
# 4 2016 4 15.74074
# 5 2016 5 16.16667

Add months of zero demand to zoo time series

I have some intermittent demand data that only includes lines where demand is present. I bring it in via read.csv, and my 2 columns are Date (as date) and Quantity (as integer). Then I convert it to a zoo series and combine the daily demand into monthly demand. My final output is a zoo series with the date being the first day of the month and the summed demand for that month.
My problem is that this zoo series is missing the in between months that have zero demand and I need these to forecast intermittent demand correctly.
For example: I have quantity 2 in date 2013-01-01 and then the next line is quantity 3 in 2013-10-01. I need to add quantity zero to 2013-02-01 through 2013-09-01.
Date <- c('1/1/2013','10/1/2013','11/1/2013')
Quantity <- c('2','3','6')
Date <- as.Date(Date, "%m/%d/%Y")
df <- data.frame(Date, Quantity)
df <- read.zoo(df)
df
The zoo series output:
2013-01-01 2013-10-01 2013-11-01
2 3 6
Because "df" is a zoo object, you may use merge.zoo and its fill argument. The current data set is merged with an empty zoo object which contains all the desired dates.
tt <- seq(min(Date), max(Date), "month")
merge(df, zoo(, tt), fill = 0)
# 2013-01-01 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 2013-09-01 2013-10-01 2013-11-01
# 2 0 0 0 0 0 0 0 0 3 6
For further examples, see ?merge.zoo ("extend an irregular series to a regular one").
You can use merge to add the missing rows and then set their values to zero.
First, let's create some fake data:
# Vector of dates from Jan 1, 2015, to Mar 31, 2015
dates = seq(as.Date("2015-01-01"), as.Date("2015-03-31"), by="1 day")
# Let's create data for few of these dates, leaving some out
set.seed(55)
dat = data.frame(dates=dates[sample(1:length(dates), 70)],
quantity=sample(1:10, 70, replace=TRUE))
dat = dat[order(dat$dates),]
Now let's make believe dat is what you imported from a csv file. We want to fill in quantity=0 for the missing dates. So first we need to add rows for the missing dates. You can do this by creating a date vector containing all dates from the first date to the last date in your csv file and using the merge function. In this case, we've already created that date vector above.
Now merge in rows for the missing dates. The new rows will have NA for quantity. We'll change those NAs to zero below.
dat = merge(data.frame(dates), dat, by="dates", all.x=TRUE)
# Set missing values to zero
dat$quantity[is.na(dat$quantity)] = 0
Now you can aggregate by month, convert to a zoo series, etc.

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Resources