First time poster here, so please forgive any faux pas on my part.
I have a set of data which consists of essentially 3 fields:
1)Position 2)Start_of_shift (datetime object) 3)End_of_Shift (datetime object)
From the datetime object I can extract date, day of week, & time. The schedules are 24/7 and do not conform to any standard 3 shift etc. rotation, they are fairly specific to a site. (I am using the lubridate package)
I would like to visualize Time of day vs. Day of Week to show numbers of staff, so that I can see heavy concentrations of staff and where I am light at specific days and times.
I am unsure on how to approach this problem as I am relatively new to R and I have found the various date time packages & base utilities confusing and often conflicting with each other. While I find plenty of examples of time series plotting, I have found next to nothing on how to plot if you have a start and end time in separate fields and want to show areas of overlap
I was thinking of using ggplot2 with geom_tile to plot this out, with a smoother, but wanted to know if there were any good examples out there that do something similar or if anyone has any idea on how I should transform my data to best achieve my end objective. I wanted to keep the time continuous but as a last resort I will discretize it into 15 minute chunks if necessary, but didn't know if there were other options?
Any thoughts?
You might consider using a gannt chart, the gannt.chart function in the plotrix package is one option for creating them.
Maybe the timeline package is what you need. I've found it very good for planning projects. It's on CRAN, but you can see a quick example at it's Github home here.
To work out how many people are present (or should be if it's a future event) you need to think of your staffing as a stock / flow.
First step would be to use the melt function in package reshape2 to get all the dates in one column and the event (starting / finishing) in another.
From this you can create a running total of how many people will be in at any time.
Related
I am relatively new to using predictive modeling and would like some brainstorming help/assessment of feasibility.
I currently have the following variables in the data set for 2018-present with one row per order
date
day of week
item category
order id
lat / long for shipping address.
I would like to predict weekly sales for the remaining weeks of this year BY item category. I am most comfortable using R at the moment.
What algorithm/package would you recommend I look into given that I would like to predict weekly sales volume by category?
The shortest answer is you start with a set of tidyverse packages. group_by() from dplyr is very powerful for computing values by some factor. To me, it sounds like you have your data in a tidy form already which works best with tidyverse framework as it allows one to easily vectorize operations over data.frame. Check out the main packages they have to offer and their overviews here. Start with simpler models like lm() and then if the need arrives continue with more advanced ones. Which one of the variables are you going to use as predictors?
No matter the model you choose, after you build the appropriate one, you can use built-in predict() together with group_by() function. More details on basic prediction here.
By the way, I can't see the data set you talk about, only the description of it. Could you provide a link to a representative sample? It would allow me to provide deeper insight.
I have a physical time series in a range of 2 year sample data with a frequency of 30 minutes, but there are multiple and wide lost data intervals as you can see there:
I tried with the function na.interp from forecast package with a bad result (shown above):
sapply(dataframeTS[2:10], na.interp)
Im looking for a more useful method.
UPDATE:
Here is more info about the pattern I want to capture, concretely the row data. This subsample belongs to May.
You might want to try the **imputeTS** package. It's an R package dedicated to time series missing value imputation.
The na_seadec(), na_seasplit(), na_kalman() methods might be interesting here
There are many more algorithm options - you can find a list in this Paper about the package.
In this specific case I would try:
na_seasplit(yourData)
or
na_kalman(yourData)
or
na_seadec(yourData)
Be aware, that it might be you need to give the seasonality information correctly with the time series. (you have to create a time series (ts object) and set the frequency parameter)
Still might be that it won't work out at all, you will have to try.
(if you can provide the data I'll also give it a try)
I am writing a machine learning project (I am quite new to this) and now I have gotten a little stuck as to what to do next.
I have 2, somewhat small datasets, one of them has timestamps for when the output has happened, the other one is the same but has the input timestamps, they are in a format: year/month/day/hour/minute/second.
I have tried to do quite a bit of feature engineering and split these columns, as well as looked into the difference between the nearest inputs, and nearest outputs to see understand the time lags better as well as try to see the downtime. I have done a lot of visualizations to see where I can go from here and now I am quite stuck. There isn't any obvious patterns that I can see.
I do not need to do time series forecasting, and am now trying to do anomaly detection on what I have.
My issue is that I have no idea what I should do with this next, maybe you have some advice on what algorithms I can apply?
I am also stuck to see whether I am able to connect the input to its output timestamp, is there any obvious ways that are usually applied to do that?
I mainly want to see patterns, and deviations in the data, I have tried looking at scrap data that is generated. I do not really know what are the good models/experiments to apply and try out in my case.
is there any data mining methods you could advise me to use?
It sounds like you are on the right track!
Here are some ideas to consider:
Is there a trend by day of week? Are weekends peak or not?
Does the hour of the day combined with day of week make a difference?
Have you looked at volume in combination with other variables? A spike in traffic on Wednesday night at 2am could be a red flag.
Basically I'd try to code in seasonality, hour, day of week, month, year, etc. into your data.
Link: How to use machine learning for anomaly detection and condition monitoring;
Mahalanobis distance
At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)
i have a problem with clustering time series in R.
I googled a lot and found nothing that fits my problem.
I have made a STL-Decomposition of Timeseries.
The trend component is in a matrix with 64 columns, one for every series.
Now i want to cluster these series in simular groups, involve the curve shapes and the timely shift. I found some functions that imply one of these aspects but not both.
First i tried to calculte a distance matrix with the dtw-distance so i
found clusters based on the values and inply the time shift but not on the shape of the timeseries. After this i tried some correlation based clustering, but then the timely shift
we're not recognized and the result dont satisfy my claims.
Is there a function that could cover my problem or have i to build up something
on my own. Im thankful for every kind of help, after two days of tutorials and examples i totaly uninspired. I hope i could explain the problem well enough to you.
I attached a picture. Here you can see some example time series.
There you could see the problem. The two series in the middle are set to one cluster,
although the upper and the one on the bottom have the same shape as one of the middle.
Have you tried the R package dtwclust
https://cran.r-project.org/web/packages/dtwclust/index.html
(I'm just starting to explore this package, but it seems like it covers a lot of aspects of time series clustering and it has lots of good references.)
you can use the kml package. It is used specifically to longitudinal data. You can consult its help. It has the next example:
### Generation of some data
cld1 <- generateArtificialLongData(25)
### We suspect 3, 4 or 6 clusters, we want 3 redrawing.
### We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,c(3,4,6),3,toPlot='both')
### 4 seems to be the best. We want 7 more redrawing.
### We don't want to see again, we want to get the result as fast as possible.
kml(cld1,4,10)
Example cluster