R growth rate calculation week over week on daily timeseries data - r

I'm trying to calculate w/w growth rates entirely in R. I could use excel, or preprocess with ruby, but that's not the point.
data.frame example
date gpv type
1 2013-04-01 12900 back office
2 2013-04-02 16232 back office
3 2013-04-03 10035 back office
I want to do this factored by 'type' and I need to wrap up the Date type column into weeks. And then calculate the week over week growth.
I think I need to do ddply to group by week - with a custom function that determines if a date is in a given week or not?
Then, after that, use diff and find the growth b/w weeks divided by the previous week.
Then I'll plot week/week growths, or use a data.frame to export it.
This was closed but had same useful ideas.

UPDATE: answer with ggplot:
All the same as below, just use this instead of plot
ggplot(data.frame(week=seq(length(gr)), gr), aes(x=week,y=gr*100)) + geom_point() + geom_smooth(method='loess') + coord_cartesian(xlim = c(.95, 10.05)) + scale_x_discrete() + ggtitle('week over week growth rate, from Apr 1') + ylab('growth rate %')
(old, correct answer but using only plot)
Well, I think this is it:
df_net <- ddply(df_all, .(date), summarise, gpv=sum(gpv)) # df_all has my daily data.
df_net$week_num <- strftime(df_net$date, "%U") #get the week # to 'group by' in ddply
df_weekly <- ddply(df_net, .(week_num), summarize, gpv=sum(gov))
gr <- diff(df_weekly$gpv)/df_weekly$gpv[-length(df_weekly$gpv)] #seems correct, but this I don't understand via: http://stackoverflow.com/questions/15356121/how-to-identify-the-virality-growth-rate-in-time-series-data-using-r
plot(gr, type='l', xlab='week #', ylab='growth rate percent', main='Week/Week Growth Rate')
Any better solutions out there?

For the last part, if you want to calculate the growth rate you can take logs and then use diff, with the default parameters lag = 1 (previos week) and difference = 1 (first difference):
df_weekly_log <- log(df_weekly)
gr <- diff(df_weekly_log , lag = 1, differences = 1)
The later is an approximation, valid for small differences.
Hope it helps.

Related

How to create a daily time series with monthly cycling patterns

I have a series of data for daily sales amount from 1/1/2018 to 10/15/2018, the example is shown as follows. It is already observed there are some monthly cycling patterns on the sales amount, say there is always a sales peak at the end of each month, and slight fluctuations in the amount in the middle of the month. Also, in general the sales in June, July and August is higher than that in other month. Now I need to predict the sales amount for the 10 days after 10/15/2018. I'm new to time series and ARIMA. Here I have two questions:
1. How to create such a daily time series and plot it with the date?
2. How can I set the cycle(or frequency) to show the monthly cycling pattern?
Date SalesAmount
1/1/2018 31,380.31
1/2/2018 384,418.10
1/3/2018 1,268,633.28
1/4/2018 1,197,742.76
1/5/2018 417,143.36
1/6/2018 693,172.65
1/8/2018 840,384.76
1/9/2018 1,955,909.69
1/10/2018 1,619,242.52
1/11/2018 2,267,017.06
1/12/2018 2,198,519.36
1/13/2018 584,448.06
1/15/2018 1,123,662.63
1/16/2018 2,010,443.35
1/17/2018 958,514.85
1/18/2018 2,190,741.31
1/19/2018 811,623.08
1/20/2018 2,016,031.26
1/21/2018 146,946.29
1/22/2018 1,946,640.57
As there isn't a reproducible example provided in the question, here's one that may help you visualize your data better.
Using the dataset: economics and library ggplot2, you can easily plot a timeseries.
library(ggplot2)
theme_set(theme_minimal())
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
geom_line(color = "#00AFBB", size = 2)
For your question, you just need to pass in x=Date and y=SalesAmount to obtain the plot below. To your 2nd question on predicting sales amount with timeseries, you can check out this question over here: Time series prediction using R
The first thing that you need before any kind of forecasting is to detect if you have any kind of seasonality. I recommend you to add more data as it is complex to determine if you have a repeated pattern with so few. Anyway you can try to determine the seasonality as follows:
library(readr)
test <- read_table2("C:/Users/Z003WNWH/Desktop/test.txt",
col_types = cols(Date = col_date(format = "%m/%d/%Y"),
SalesAmount = col_number()))
p<-periodogram(test$SalesAmount)
topF = data.table(freq=p$freq, spec=p$spec) %>% arrange(desc(spec))
1/topF
When you will add more data you can try to use ggseasonplot to visualize the different seasons.

How to create boxplot based on 5 year intervals in R

I have a continuous variable y measured on different dates. I need to make boxplots with a box showing the distribution of y for each 5 year interval.
Sample data:
rdob <- as.Date(dob, format= "%m/%d/%y")
ggplot(data = data, aes(x=rdob, y=ageyear)) + geom_boxplot()
#Warning message:
#Continuous x aesthetic -- did you forget aes(group=...)?
This image is the first one I tried. What I want is a box for every five year interval, instead of a box for every year.
Here is a way to pull out the year in base R:
format(as.Date("2008-11-03", format="%Y-%m-%d"), "%Y")
Simply wrap your date vector in a format() and add the "%Y". To get this to be integer, you can use as.integer.
You could also take a look at the year function in the lubridate package which will make this extraction a little bit more straightforward.
One method to get 5 year intervals is to use cut to create a factor variable that creates levels at selected break points. Unless you have dozens of years your best bet would be to set the break points manually:
df$myTimeInterval <- cut(df$years, breaks=c(1995, 2000, 2005, 2010, 2015))
Here's an example taking Dave2e's suggestion of using cut on date intervals along with ggplot's group aesthetic mapping:
library(ggplot2)
n <- 1000
## Randomly sample birth dates and dummy up an effect that trends upward with DOB
dobs <- sample(seq(as.Date('1970/01/01'), Sys.Date(), by="day"), n)
effect <- rnorm(n) + as.numeric(as.POSIXct(dobs)) / as.numeric(as.POSIXct(Sys.Date()))
data <- data.frame(dob=dobs, effect=effect)
## boxplot w/ DOB binned to 5 year intervals
ggplot(data=data, aes(x=dob, y=effect)) + geom_boxplot(aes(group=cut(dob, "5 year")))
library(lubridate)
year=year(rdob)

starting a daily time series in R

I have a daily time series about number of visitors on the web site. my series start from 01/06/2014 until today 14/10/2015 so I wish to predict number of visitor for in the future. How can I read my series with R? I'm thinking:
series <- ts(visitors, frequency=365, start=c(2014, 6))
if yes,and after runing my time series model arimadata=auto.arima() I want to predict visitor's number for the next 6o days, how can i do this?
h=..?
forecast(arimadata,h=..),
the value of h shoud be what ?
thanks in advance for your help
The ts specification is wrong; if you are setting this up as daily observations, then you need to specify what day of the year 2014 is June 1st and specify this in start:
## Create a daily Date object - helps my work on dates
inds <- seq(as.Date("2014-06-01"), as.Date("2015-10-14"), by = "day")
## Create a time series object
set.seed(25)
myts <- ts(rnorm(length(inds)), # random data
start = c(2014, as.numeric(format(inds[1], "%j"))),
frequency = 365)
Note that I specify start as c(2014, as.numeric(format(inds[1], "%j"))). All the complicated bit is doing is working out what day of the year June 1st is:
> as.numeric(format(inds[1], "%j"))
[1] 152
Once you have this, you're effectively there:
## use auto.arima to choose ARIMA terms
fit <- auto.arima(myts)
## forecast for next 60 time points
fore <- forecast(fit, h = 60)
## plot it
plot(fore)
That seems suitable given the random data I supplied...
You'll need to select appropriate arguments for auto.arima() as suits your data.
Note that the x-axis labels refer to 0.5 (half) of a year.
Doing this via zoo
This might be easier to do via a zoo object created using the zoo package:
## create the zoo object as before
set.seed(25)
myzoo <- zoo(rnorm(length(inds)), inds)
Note you now don't need to specify any start or frequency info; just use inds computed earlier from the daily Date object.
Proceed as before
## use auto.arima to choose ARIMA terms
fit <- auto.arima(myts)
## forecast for next 60 time points
fore <- forecast(fit, h = 60)
The plot though will cause an issue as the x-axis is in days since the epoch (1970-01-01), so we need to suppress the auto plotting of this axis and then draw our own. This is easy as we have inds
## plot it
plot(fore, xaxt = "n") # no x-axis
Axis(inds, side = 1)
This only produces a couple of labeled ticks; if you want more control, tell R where you want the ticks and labels:
## plot it
plot(fore, xaxt = "n") # no x-axis
Axis(inds, side = 1,
at = seq(inds[1], tail(inds, 1) + 60, by = "3 months"),
format = "%b %Y")
Here we plot every 3 months.
Time Series Object does not work well with creating daily time series. I will suggest you use the zoo library.
library(zoo)
zoo(visitors, seq(from = as.Date("2014-06-01"), to = as.Date("2015-10-14"), by = 1))
Here's how I created a time series when I was given some daily observations with quite a few observations missing. #gavin-simpson gave quite a big help. Hopefully this saves someone some grief.
The original data looked something like this:
library(lubridate)
set.seed(42)
minday = as.Date("2001-01-01")
maxday = as.Date("2005-12-31")
dates <- seq(minday, maxday, "days")
dates <- dates[sample(1:length(dates),length(dates)/4)] # create some holes
df <- data.frame(date=sort(dates), val=sin(seq(from=0, to=2*pi, length=length(dates))))
To create a time-series with this data I created a 'dummy' dataframe with one row per date and merged that with the existing dataframe:
df <- merge(df, data.frame(date=seq(minday, maxday, "days")), all=T)
This dataframe can be cast into a timeseries. Missing dates are NA.
nts <- ts(df$val, frequency=365, start=c(year(minday), as.numeric(format(minday, "%j"))))
plot(nts)
series <- ts(visitors, frequency=365, start=c(2014, 152))
152 number is 01-06-2014 as it start from 152 number because of frequency=365
To forecast for 60 days, h=60.
forecast(arimadata , h=60)

R - ggplot2 - How to use limits on POSIX axis?

What is the smartest way to manipulate POSIX for use in ggplot axis?
I am trying to create a function for plotting many graphs (One per day) spanning a period of weeks, using POSIX time for the x axis.
To do so, I create an additional integer column DF$Day with the day, that I input into the function. Then, I create a subset using that day, which I plot using ggplot2. I figured how to use scale_x_datetime to format the POSIX x axis. Basically, I have it show the hours & minutes only, omitting the date.
Here is my question: How can I set the limits for each individual graph in hours of the day?
Below is some working, reproducible code to get an idea. It creates the first day, shows it for 3 seconds & the proceeds to create the second day. But, each days limits is chosen based on the range of the time variable. How can I make the range, for instance, all day long (0h - 24h)?
DF <- data.frame(matrix(ncol = 0, nrow = 4))
DF$time <- as.POSIXct(c("2010-01-01 02:01:00", "2010-01-01 18:10:00", "2010-01-02 04:20:00", "2010-01-02 13:30:00"))
DF$observation <- c(1,2,1,2)
DF$Day <- c(1,1,2,2)
for (Individual_Day in 1:2) {
Day_subset <- DF[DF$Day == as.integer(Individual_Day),]
print(ggplot( data=Day_subset, aes_string( x="time", y="observation") ) + geom_point() +
scale_x_datetime( breaks=("2 hour"), minor_breaks=("1 hour"), labels=date_format("%H:%M")))
Sys.sleep(3) }
Well, here's one way.
# ...
for (Individual_Day in 1:2) {
Day_subset <- DF[DF$Day == as.integer(Individual_Day),]
lower <- with(Day_subset,as.POSIXct(strftime(min(time),"%Y-%m-%d")))
upper <- with(Day_subset,as.POSIXct(strftime(as.Date(max(time))+1,"%Y-%m-%d"))-1)
limits = c(lower,upper)
print(ggplot( data=Day_subset, aes( x=time, y=observation) ) +
geom_point() +
scale_x_datetime( breaks=("2 hour"),
minor_breaks=("1 hour"),
labels=date_format("%H:%M"),
limits=limits)
)
}
The calculation for lower takes the minimum time in the subset and coerces it to character with only the date part (e.g., strips away the time part). Converting back to POSIXct generates the beginning of that day.
The calculation for upper is a little more complicated. You have to convert the maximum time to a Date value and add 1 (e.g., 1 day), then convert to character (strip off the time part), convert back to POSIXct, and subtract 1 (e.g., 1 second). This generates 23:59 on the end day.
Huge amount of work for such a small thing. I hope someone else posts a simpler way to do this...

R and ggplot2: Make line graph of sum of value for three categorical variables over time

I'm trying to figure out how to do something with ggplot2 and R that seems like it should be really simple. It's so simple... that I cannot for the life of me figure out how to do it. I'm sure the answer is staring me in the face in the ggplot documentation, but I can't... find it. So. I'm here.
I frequently have datasets a lot like this:
tdf <- data.frame('datetime' = seq(from=as.POSIXct('2012-01-01 00:00:00'),
to=as.POSIXct('2012-01-31 23:59:59'), by=1))
tdf$variable <- rep(c('a','b','c'), length.out=length(tdf$datetime))
tdf$value <- sample(1:10, length(tdf$datetime), replace=T)
> head(tdf)
datetime variable value
1 2012-01-01 00:00:00 a 7
2 2012-01-01 00:00:01 b 3
3 2012-01-01 00:00:02 c 7
4 2012-01-01 00:00:03 a 8
5 2012-01-01 00:00:04 b 2
6 2012-01-01 00:00:05 c 3
That is: I have a categorical variable (a factor), a value for that variable, and a timestamp at which said observation was recorded. I want to plot the sum of the value, for each categorical variable, for a given time "bucket" -- preferably using ggplot2. I would like to do it without having to pre-aggregate it before I visualize it -- that is, I really want the flexibility of leaving the dataset as it is and passing the arguments to ggplot2 to aggregate it at on time. And yet, I'm completely flummoxed. The documentation on geom_line says to use stat='identity' to get sum of value, but once I've done that I can no longer define any kind of bin. If I use stat_summary, I frequently don't get a plot back at all. The closest I've gotten is:
tdf$variable <- factor(tdf$variable)
vis <- ggplot(tdf, aes(x=datetime, y=value, color=variable))
vis <- vis + geom_line(stat='identity')
vis <- vis + scale_x_datetime()
...which at least prints a plot, with a line corresponding to the values of each factor... by second. I cannot get it to bin the sum(value) operation for, say, an hour or a day or a week without doing a bunch of work to pre-aggregate the data.
Help?
Edit: Apologies to anyone whose R session choked on this test data. I've cut it back.
Alright, I think this is what you want. I've cut down your dataset dramatically, the posted one is waaaay to big for a testing this stuff out.
tdf <- data.frame('datetime' = seq(from=as.POSIXct('2012-01-01 00:00:00'), to=as.POSIXct('2012-01-01 00:10:59'), by=1))
tdf$variable <- rep(c('a','b','c'), length.out=length(tdf$datetime))
tdf$value <- sample(1:10, length(tdf$datetime), replace=T)
tdf$variable <- factor(tdf$variable)
vis2 <- ggplot(tdf, aes(datetime, color=variable)) +
geom_bar(binwidth=5,aes(weight=value),position="dodge") +
scale_x_datetime(limits=c(min(tdf$datetime), max(tdf$datetime)))
geom_bar uses stat_bin so you can change your bins. By default it gets teh counts, but if you want the sum, you can add the weight argument in aes(). Let me know if this is not answering your question.
BTW, with the way this specific data is setup, is would probably make sense to separate your variables, using something like facet, ie:
vis2 <- ggplot(tdf, aes(datetime, fill=variable)) +
geom_bar(binwidth=100,aes(weight=value),position="dodge") +
scale_x_datetime(limits=c(min(tdf$datetime), max(tdf$datetime))) +
facet_wrap(~variable)
Otherwise it might look like the variable are across different time bins.

Resources