How to properly use time-series objects in R? - r

I have some daily data which looks like this:
> head(daily_leads)
date gross.leads day_of_week month
1 2007-01-01 6427 2 1
2 2007-01-02 7111 3 1
3 2007-01-03 7367 4 1
4 2007-01-04 7431 5 1
5 2007-01-05 7257 6 1
6 2007-01-06 7231 7 1
There are some clear intra-week trends and the same day of the week usually corresponds to similar levels (Friday's are similar counts to Fridays). There are also some trends going on here. Also there are trends over the course of the year.
What is the correct way to set up a time-series object in R in order to make forecasts in my case? When I try this:
> leads <- ts(daily_leads$gross.leads, frequency = 1/7)
> q <- decompose(leads)
Error in decompose(leads) : time series has no or less than 2 periods
I'm not sure if I am setting it up correctly but I want the time-series object to reflect the seasonality within the week and year. Any help would be much appreciated. I have 2543 observations of the data.

You need frequency=7. In time series, the frequency is the number of observations per seasonal period.

Related

R subset column of data frame based on list of dates

PeopleId Date NumItemsQuotes
(dbl) (time) (dbl)
1 2 2015-10-26 147
2 2 2015-10-27 4
3 2 2015-10-28 268
4 2 2015-10-30 55
5 2 2015-11-02 1
6 2 2015-11-03 6
Given the above data frame, I want to subset this to just the observations which occur on a weekend. So far, I've created a vector of the weekends because its a small amount (only 5 weekends) but I'm stuck in this approach.
weekends<- as.Date(c("2015-10-24","2015-10-25","2015-10-30","2015-11-01","2015-11-07","2015-11-08","2015-11-14","2015-11-15","2015-11-21","2015-11-22"))
weekenders<-itemquote[,"PeopleId"][itemquote$Date == weekends]
In the interest of genuinely learning this concept - I could also create an index column from timeDate::isWeekend() and then drop false observations correct?
Any other suggestions for approaching this?
If you want to stick to your method using the object weekends, you could use %in%. So your code would be weekenders<-itemquote[,"PeopleId"][itemquote$Date == weekends] or in a shorter way:weekenders <- subset(itemquote, Date %in% weekends)$PeopleId.

R: How to plot multiple series when the series is included as a variable?

I want to plot multiple lines to one graph of five different time series. The problem is that my data frame is arranged like so:
Series Time Price ...
1 Dec 2003 5
2 Dec 2003 10
3 Dec 2003 2
1 Jan 2004 10
2 Jan 2004 10
3 Jan 2004 5
This is a simplified version, and there are many other variables for each observation. I'd like to be able to plot time vs price and use the first variable as the indicator for which series.
The time period is 77 months long, so I'm not sure if there's an easy way to reshape the data to look like:
Series Dec.2003.Price Jan.2004.Price ...
1 5 10
2 10 10
3 2 5
or a way to graph these like I said without reshaping.
You can try
xyplot(Price ~ Time, groups=Series, data=df, type="l")

How to convert a time series into something useful?

I have a time series in the following format:
Date SKU Sales
1 2014-01-02 000823307B 5
2 2014-01-03 0008233043 52
3 2014-01-03 000823307B 4
4 2014-01-04 000823307B 10
5 2014-01-05 000823307B 1
6 2014-01-06 0008233043 10
7 2014-01-06 0008233053 43
8 2014-01-06 000823307B 7
9 2014-01-07 0008233043 30
10 2014-01-07 0008233053 5
I would like to find out if the sales of the different SKUs correlate. How can I do this in R? I struggle to find a starting point. I don't quite understand how to convert this into a ts object and if this would be the right approach.
After reading the comments I figured the question might be difficult to understand for some people. Luckily I got some very useful hints as well.
I will solve the problem by creating a pivot table and then trying to group the sales data by week. By doing this I should be able to draw a line diagram showing the different sales of the products, and also correlate the different sales pattern.
You can represent your data such that each SKU is a variable and each day/month (choose a time period that can give you meaningful values for correlation) is an observation.
You can then perform a PCA using princomp() or prcomp(). This would be a sufficient starting point.

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Flowstrates and R: extracting and reshaping data in required format

I'm working on trying to transform a large dataset into the required formats for analyzing within the flowstrates package.
What I currently have is a large file (600k trips) with origin and destination points.
Format is sort of like this:
tripID Month start_pt end_pt
1 June 1 3
2 June 1 3
3 July 1 5
4 July 1 7
5 July 1 7
What I need to be able to generate is a file that has trip counts by unit time (let's say months) in a format like this:
start_pt end_pt June July August ... December
1 3 2 0 5 9
1 5 0 1 4 4
1 7 0 2 0 0
It's easy enough to pre-segment the data by month and then generate counts for each origin-destination pair, but then putting it all back together causes all sorts of problems since each of the pre-segmented chunks of data have very different sizes. So it seems that I'd need to do this for the entire dataset at once.
Are there any packages for doing this type of processing? Would it be easier to do this in something like SQL or SQLite?
Thanks in advance for any help.
You can use the reshape2 package to do this fairly easily.
If your data is dat,
library("reshape2")
dcast(dat, start_pt+end_pt~Month, value.var="tripID", fun.aggregate=length)
This gives a single entry for each start_pt/end_pt/Month combination, the value of which is how many cases had that combination (the length of tripID for that set).

Resources