I have a series of data for daily sales amount from 1/1/2018 to 10/15/2018, the example is shown as follows. It is already observed there are some monthly cycling patterns on the sales amount, say there is always a sales peak at the end of each month, and slight fluctuations in the amount in the middle of the month. Also, in general the sales in June, July and August is higher than that in other month. Now I need to predict the sales amount for the 10 days after 10/15/2018. I'm new to time series and ARIMA. Here I have two questions:
1. How to create such a daily time series and plot it with the date?
2. How can I set the cycle(or frequency) to show the monthly cycling pattern?
Date SalesAmount
1/1/2018 31,380.31
1/2/2018 384,418.10
1/3/2018 1,268,633.28
1/4/2018 1,197,742.76
1/5/2018 417,143.36
1/6/2018 693,172.65
1/8/2018 840,384.76
1/9/2018 1,955,909.69
1/10/2018 1,619,242.52
1/11/2018 2,267,017.06
1/12/2018 2,198,519.36
1/13/2018 584,448.06
1/15/2018 1,123,662.63
1/16/2018 2,010,443.35
1/17/2018 958,514.85
1/18/2018 2,190,741.31
1/19/2018 811,623.08
1/20/2018 2,016,031.26
1/21/2018 146,946.29
1/22/2018 1,946,640.57
As there isn't a reproducible example provided in the question, here's one that may help you visualize your data better.
Using the dataset: economics and library ggplot2, you can easily plot a timeseries.
library(ggplot2)
theme_set(theme_minimal())
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
geom_line(color = "#00AFBB", size = 2)
For your question, you just need to pass in x=Date and y=SalesAmount to obtain the plot below. To your 2nd question on predicting sales amount with timeseries, you can check out this question over here: Time series prediction using R
The first thing that you need before any kind of forecasting is to detect if you have any kind of seasonality. I recommend you to add more data as it is complex to determine if you have a repeated pattern with so few. Anyway you can try to determine the seasonality as follows:
library(readr)
test <- read_table2("C:/Users/Z003WNWH/Desktop/test.txt",
col_types = cols(Date = col_date(format = "%m/%d/%Y"),
SalesAmount = col_number()))
p<-periodogram(test$SalesAmount)
topF = data.table(freq=p$freq, spec=p$spec) %>% arrange(desc(spec))
1/topF
When you will add more data you can try to use ggseasonplot to visualize the different seasons.
Related
I have two summer's worth of hourly precip and discharge data collected from Durango data stations from 2012-2013. For this research, I am analyzing how each precip event impacts river discharge on an hourly basis. The discharge data has data every 15 minutes, every hour, every day no matter what the weather. The precip data only has times for hours that had rain. Here are two graphs I made of the first few precip events I have:
#after loading in my .CSVs 'animas' and 'durango':
disc1 <- animas[c(8700:9000), c(3,5)]
prec1 <- durango[c(3:11),c(6:7)]
ggplot(data = disc1, aes(x=datetime, discharge))+geom_point()+theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(data = prec1, aes(x=DATE, HPCP))+ geom_point()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
discharge, all hours get plotted
Precipitation, missing hours as zeros
The way that precipitation is plotting with missing hours is unacceptable for my objective. I need to somehow generate these missing hours and fill the empty precipitation("HPCP") values with zeros, so I can plot it on the same time scale as discharge.
Also, is there a way to separate this data into individual precipitation events, excluding events that total less than 0.05 inches? (As opposed to setting all the time bounds for hundreds of precipitation events by hand). I need to generate sets of hours that a precipitation event occurred and add the discharge values for those hours. Will be plotting these eventually, as well as taking the difference in time between peak rainfall an peak discharge. What data structure should I use, and how?
This seems difficult because zeros between precipitation hours not present in all cases; for example, two rain events from different dates could be in adjacent rows, one after the other. How can I sort it fast? Can a tail be added to include points from 6 hours before and after the start/end time?
I have messed around with the .csv to obtain two possible date/time configurations (HPCP in this file is precip). Which is better for convenience and plotting with ggplots?
All the hours with 0's in HPCP are measurement hours with the flag of 'F', which means a trace amount of precipitation was detected. These are too insignificant for my analysis.
Thank you in advance.
I have RStudio and want to import a time series data set. The column on the x-axis should be the year, however when I use the ts.plot command it just plots Time on the x-axis. How can I make the years from the data set appear on my plot?
The data set is for Water Usage in NYC from 1898 to 1968. There are two columns, The Year and Water Usage.
This is the link to the data I used (I have donwnloaded the .TSV file)
https://datamarket.com/data/set/22tl/annual-water-use-in-new-york-city-litres-per-capita-per-day-1898-1968#!ds=22tl&display=line
These are the commands for importing my data:
nyc <- read.csv("~/Desktop/annual-water-use-in-new-york-cit.tsv", sep="")
View(nyc)
ts.plot(nyc)
This is what I get:
There are several ways to do this. I used the CSV file from your link in this demonstration.
library(tidyverse)
nyc <- read_csv("annual-water-use-in-new-york-cit.csv")
head(nyc)
# A tibble: 6 x 2
Year `Annual water use in New York city, litres per capita per day, 1898-1968`
<chr> <chr>
1 1898 402.8
2 1899 421.3
3 1900 431.2
4 1901 426.2
5 1902 425.5
6 1903 423.6
Method 1
Create a time series object and plot this time series.
Firstly, let us fix the column name of the annual water use so that it is easier to call in our code.
nyc <- nyc %>%
rename(
water_use = `Annual water use in New York city, litres per capita per day, 1898-1968`
)
Make the time series object nyc.ts with the ts() function.
nyc.ts <- ts(as.numeric(nyc$water_use), start = 1898)
You can then use the generic plot function to plot the time series.
plot(nyc.ts, xlab = "Years")
Method 2
Use the forecast::autoplot function. Note that this function is built on top of ggplot2.
autoplot(nyc.ts) + xlab("Years") + ylab("Amount in Litres")
Method 3
With just ggplot2:
nyc$Year <- as.POSIXct(nyc$Year, format = "%Y")
nyc$water_use <- as.numeric(nyc$water_use)
ggplot(nyc, aes(x = Year, y = water_use)) + geom_line() + xlab("Years") + ylab("Amount in Litres")
I have a daily time series about number of visitors on the web site. my series start from 01/06/2014 until today 14/10/2015 so I wish to predict number of visitor for in the future. How can I read my series with R? I'm thinking:
series <- ts(visitors, frequency=365, start=c(2014, 6))
if yes,and after runing my time series model arimadata=auto.arima() I want to predict visitor's number for the next 6o days, how can i do this?
h=..?
forecast(arimadata,h=..),
the value of h shoud be what ?
thanks in advance for your help
The ts specification is wrong; if you are setting this up as daily observations, then you need to specify what day of the year 2014 is June 1st and specify this in start:
## Create a daily Date object - helps my work on dates
inds <- seq(as.Date("2014-06-01"), as.Date("2015-10-14"), by = "day")
## Create a time series object
set.seed(25)
myts <- ts(rnorm(length(inds)), # random data
start = c(2014, as.numeric(format(inds[1], "%j"))),
frequency = 365)
Note that I specify start as c(2014, as.numeric(format(inds[1], "%j"))). All the complicated bit is doing is working out what day of the year June 1st is:
> as.numeric(format(inds[1], "%j"))
[1] 152
Once you have this, you're effectively there:
## use auto.arima to choose ARIMA terms
fit <- auto.arima(myts)
## forecast for next 60 time points
fore <- forecast(fit, h = 60)
## plot it
plot(fore)
That seems suitable given the random data I supplied...
You'll need to select appropriate arguments for auto.arima() as suits your data.
Note that the x-axis labels refer to 0.5 (half) of a year.
Doing this via zoo
This might be easier to do via a zoo object created using the zoo package:
## create the zoo object as before
set.seed(25)
myzoo <- zoo(rnorm(length(inds)), inds)
Note you now don't need to specify any start or frequency info; just use inds computed earlier from the daily Date object.
Proceed as before
## use auto.arima to choose ARIMA terms
fit <- auto.arima(myts)
## forecast for next 60 time points
fore <- forecast(fit, h = 60)
The plot though will cause an issue as the x-axis is in days since the epoch (1970-01-01), so we need to suppress the auto plotting of this axis and then draw our own. This is easy as we have inds
## plot it
plot(fore, xaxt = "n") # no x-axis
Axis(inds, side = 1)
This only produces a couple of labeled ticks; if you want more control, tell R where you want the ticks and labels:
## plot it
plot(fore, xaxt = "n") # no x-axis
Axis(inds, side = 1,
at = seq(inds[1], tail(inds, 1) + 60, by = "3 months"),
format = "%b %Y")
Here we plot every 3 months.
Time Series Object does not work well with creating daily time series. I will suggest you use the zoo library.
library(zoo)
zoo(visitors, seq(from = as.Date("2014-06-01"), to = as.Date("2015-10-14"), by = 1))
Here's how I created a time series when I was given some daily observations with quite a few observations missing. #gavin-simpson gave quite a big help. Hopefully this saves someone some grief.
The original data looked something like this:
library(lubridate)
set.seed(42)
minday = as.Date("2001-01-01")
maxday = as.Date("2005-12-31")
dates <- seq(minday, maxday, "days")
dates <- dates[sample(1:length(dates),length(dates)/4)] # create some holes
df <- data.frame(date=sort(dates), val=sin(seq(from=0, to=2*pi, length=length(dates))))
To create a time-series with this data I created a 'dummy' dataframe with one row per date and merged that with the existing dataframe:
df <- merge(df, data.frame(date=seq(minday, maxday, "days")), all=T)
This dataframe can be cast into a timeseries. Missing dates are NA.
nts <- ts(df$val, frequency=365, start=c(year(minday), as.numeric(format(minday, "%j"))))
plot(nts)
series <- ts(visitors, frequency=365, start=c(2014, 152))
152 number is 01-06-2014 as it start from 152 number because of frequency=365
To forecast for 60 days, h=60.
forecast(arimadata , h=60)
I'm trying to calculate w/w growth rates entirely in R. I could use excel, or preprocess with ruby, but that's not the point.
data.frame example
date gpv type
1 2013-04-01 12900 back office
2 2013-04-02 16232 back office
3 2013-04-03 10035 back office
I want to do this factored by 'type' and I need to wrap up the Date type column into weeks. And then calculate the week over week growth.
I think I need to do ddply to group by week - with a custom function that determines if a date is in a given week or not?
Then, after that, use diff and find the growth b/w weeks divided by the previous week.
Then I'll plot week/week growths, or use a data.frame to export it.
This was closed but had same useful ideas.
UPDATE: answer with ggplot:
All the same as below, just use this instead of plot
ggplot(data.frame(week=seq(length(gr)), gr), aes(x=week,y=gr*100)) + geom_point() + geom_smooth(method='loess') + coord_cartesian(xlim = c(.95, 10.05)) + scale_x_discrete() + ggtitle('week over week growth rate, from Apr 1') + ylab('growth rate %')
(old, correct answer but using only plot)
Well, I think this is it:
df_net <- ddply(df_all, .(date), summarise, gpv=sum(gpv)) # df_all has my daily data.
df_net$week_num <- strftime(df_net$date, "%U") #get the week # to 'group by' in ddply
df_weekly <- ddply(df_net, .(week_num), summarize, gpv=sum(gov))
gr <- diff(df_weekly$gpv)/df_weekly$gpv[-length(df_weekly$gpv)] #seems correct, but this I don't understand via: http://stackoverflow.com/questions/15356121/how-to-identify-the-virality-growth-rate-in-time-series-data-using-r
plot(gr, type='l', xlab='week #', ylab='growth rate percent', main='Week/Week Growth Rate')
Any better solutions out there?
For the last part, if you want to calculate the growth rate you can take logs and then use diff, with the default parameters lag = 1 (previos week) and difference = 1 (first difference):
df_weekly_log <- log(df_weekly)
gr <- diff(df_weekly_log , lag = 1, differences = 1)
The later is an approximation, valid for small differences.
Hope it helps.
Is it possible to have a datetime scale not consider weekends as part of the time continuum? For instance, if I am plotting stock prices over a 2 week period with a line geometry, I do not want to plot a 2 day period of flattness during the weekend. I would like friday to connect with Monday.
I imagine that there's a better way, but you could always just use an index for the plot and then assign the dates as labels afterwards:
p <- qplot(1:3, 1:3, geom='line')
p + scale_x_continuous("", breaks=1:3, labels = as.Date(c("2010-06-03", "2010-06-04", "2010-06-07")))