I am fairly new to this field and I would like to get some help/advices. Any help would be much appreciated!
I am currently doing a forecasting project with time series data. However, it does not contain any weekend/holiday data. My goal is to predict the future value on a specific date. For example, with given data from 2000-present, I would like to predict the value of 2023-05-01. I tried creating some plots and use the zoo package. However, I am unsure how to approach this unevenly spaced data. Can someone provide me with some ideas of what model I should try? Btw, I am using R for this project. Thank you all so much!
I would agree with #jhoward that this is missing data, not unevenly spaced (like timestamped data). So you can interpolate the missing data. Maybe this helps for an overview of the possible techniques: 4-techniques-to-handle-missing-values-in-time-series-data
I am relatively new to using predictive modeling and would like some brainstorming help/assessment of feasibility.
I currently have the following variables in the data set for 2018-present with one row per order
date
day of week
item category
order id
lat / long for shipping address.
I would like to predict weekly sales for the remaining weeks of this year BY item category. I am most comfortable using R at the moment.
What algorithm/package would you recommend I look into given that I would like to predict weekly sales volume by category?
The shortest answer is you start with a set of tidyverse packages. group_by() from dplyr is very powerful for computing values by some factor. To me, it sounds like you have your data in a tidy form already which works best with tidyverse framework as it allows one to easily vectorize operations over data.frame. Check out the main packages they have to offer and their overviews here. Start with simpler models like lm() and then if the need arrives continue with more advanced ones. Which one of the variables are you going to use as predictors?
No matter the model you choose, after you build the appropriate one, you can use built-in predict() together with group_by() function. More details on basic prediction here.
By the way, I can't see the data set you talk about, only the description of it. Could you provide a link to a representative sample? It would allow me to provide deeper insight.
At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)
First time poster here, so please forgive any faux pas on my part.
I have a set of data which consists of essentially 3 fields:
1)Position 2)Start_of_shift (datetime object) 3)End_of_Shift (datetime object)
From the datetime object I can extract date, day of week, & time. The schedules are 24/7 and do not conform to any standard 3 shift etc. rotation, they are fairly specific to a site. (I am using the lubridate package)
I would like to visualize Time of day vs. Day of Week to show numbers of staff, so that I can see heavy concentrations of staff and where I am light at specific days and times.
I am unsure on how to approach this problem as I am relatively new to R and I have found the various date time packages & base utilities confusing and often conflicting with each other. While I find plenty of examples of time series plotting, I have found next to nothing on how to plot if you have a start and end time in separate fields and want to show areas of overlap
I was thinking of using ggplot2 with geom_tile to plot this out, with a smoother, but wanted to know if there were any good examples out there that do something similar or if anyone has any idea on how I should transform my data to best achieve my end objective. I wanted to keep the time continuous but as a last resort I will discretize it into 15 minute chunks if necessary, but didn't know if there were other options?
Any thoughts?
You might consider using a gannt chart, the gannt.chart function in the plotrix package is one option for creating them.
Maybe the timeline package is what you need. I've found it very good for planning projects. It's on CRAN, but you can see a quick example at it's Github home here.
To work out how many people are present (or should be if it's a future event) you need to think of your staffing as a stock / flow.
First step would be to use the melt function in package reshape2 to get all the dates in one column and the event (starting / finishing) in another.
From this you can create a running total of how many people will be in at any time.
I have the following time series data. It has 60 data points shown below. Please see a simple plot of this data below. I am using R for plotting this. I think that if I draw a moving average curve on the points in the graph, then we can better understand the patterns in the data. I don't know how to do it in R. Could some one help me to do that. Additionally, I am not sure whether this is a good way to identify patterns or not. Please also suggest me if there is any better way. Thank you.
x <- c(18,21,18,14,8,14,10,14,14,12,12,14,10,10,12,6,10,8,
14,10,10,6,6,4,6,2,8,6,2,6,4,4,2,8,6,6,8,12,8,8,6,6,2,2,4,
4,4,8,14,8,6,6,2,6,6,4,4,8,6,6)
To answer your question about moving averages, you could accomplish it with the help of rollmean which is in package zoo.
From Joshua's comment: You could also look into TTR package that depends on xts that depends on zoo. Also, there are other moving averages in the package TTR: check ?MA.
require(TTR)
# assuming your vector is loaded in dat
# sliding window / moving average of size 5
dat.k5 <- rollmean(dat, k=5)
One reasonable possibility:
d <- data.frame(x=scan("tmp.dat"))
qplot(x=seq(nrow(d)),x,data=d)+geom_smooth(method="loess")
edit: moved from comment to answer, based on https://meta.stackexchange.com/questions/164783/why-was-a-seemingly-relevant-non-offensive-comment-removed
With regard to "is this a good way to identify patterns" (which is a little off-topic for StackOverflow, but whatever); I think rolling means are perfectly respectable, although more sophisticated methods (such as the locally-weighted regression [loess/lowess] shown here) do exist. However, it doesn't look to me as though there is much of a complicated pattern to detect here: the data seem to initially decline with time, then level off. Rolling means and more sophisticated approaches may look prettier, but I don't think they will identify any deeper patterns in this data set ...
If you want to do this sort of thing for multiple data sets at once (as indicated in your comment), you may like ggplot's capabilities for automatically producing multi-line or faceted versions of the same plot.