How to cluster according to hour of specific day - r

I have the logs of the amount of arrivals at a bank , every half an hour for one month.
I am trying to find different cluster groups according to the amount of "arrivals". I tried according to the day, and i tried according to the hour (not of a specific day). I would like to try according to the hour of a specific day.
An example:
Thursdays at 14:00 and Sundays at 15:00 are one cluster with an average of 10000 arrivals
Mondays at 13:00, Mondays at 10:00 and Tuesdays at 16:00 are one cluster with an average of 15000 arrivals.
all the rest are another cluster with an average of 2000 arrivals.
I have a csv file with the columns: Date, Day(1-7), Time, Arrivals
Until now I used this:
km <- kmeans(table, 3, 15)
plot(km)
(i tried 3 clusters) - this code clusters pairs .( a matrix of 3x3 with a plot of each 2 out of 3 columns)
Is there a way to do that?

k-means and similar algorithms will yield meaningless results on this kind of data.
The problem is you are using the wrong tool for the wrong problem on the wrong data.
Your data is: Date, Day(1-7), Time, Arrivals
K-means will try to minimize variance. But does variance make any sense on this data set? How do you know hich k makes most sense? Since Arrivals likely has the largest variance of these attributes, it will completely dominate your result.
The question you should first try to answer is: what is a good result? Then, consider ways of visualizing the results to verify that you are up to something. And when you've visualized the data, consider ways to manually mark the desired result on the visualization, this may well be good enough for you. Better than praying for k-means to yield a somewhat meaningful result; because on this kind of mixed type data, it usually does not work very well.

Related

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

T tests in R- unable to run together

I have an airline dataset from stat computing which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing
Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T)
flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)
# This tests each hour of departures against the entire day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.
pVsDay <- tapply(flights$ArrDelay, flights$Dep,
function(x) t.test(x, flights$ArrDelay, alternative = "less"))
# This tests each hour of departures against every other hour of the day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep,
function(x) tapply(flights$ArrDelay, flights$Dep, function (z)
t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.
All vs All

Designating dynamic start in time-series vectors in R

I have some problems with time-series designation of vectors in R.
I work with time-series and when I want to set a vector to a certain period, I feel quite confident about how to do it. I have simply done as follow name<- ts(name, frequency=12, start=c(2007,1)). As you can see I have monthly data
I am making an R template for colleagues to use, and I want them to be able to carry out a recursive ARIMA regression from any given starting point. That is, I have a range of in-sample predicted valued and I want to designate a start-value that is n monthly observation after 2007 (or whatever start data is used), where n is the start-value of the recursive regression.
first and last from the xts time-series package do exactly what you want.i.e. to get the first 2 months of an object x:
first(x, '2 months’)
or the last 6 weeks:
last(x, '6 weeks’)
Valid period.types are: secs, seconds, mins, minutes, hours, days, weeks, months, quarters, and years. As always you can find much more detailed information using ?xts::first.

Statistical analysis on daily data

I have a number of data points that I am trying to extract a meaningful pattern from (or derive an equation that could then be predictive). I am trying to find a correlation (?) between RANK and DAILY SALES for any given ITEM.
So, for any given item, I have (say) two weeks of daily information, each day consists of a pairing of Inventory, and Rank.
ITEM #1
Monday: 20 in stock (rank 30)
Tuesday: 17 in stock (rank 29)
Wednesday: 14 in stock (rank 31)
The presumption is that 3 items were sold each day, and that selling ~3 a day is roughly what it means to have a rank of ~30.
Given information like this across a wide span (20,000 items, over 2 weeks) of inventory/rank/date pairings, I'd like to derive an equation/method of estimating what the daily sales would be for any given rank.
There's one problem:
The data isn't entirely clean, because -occasionally- the inventory fluctuates upward, either because of re-stocking, or because of returns. So for example, you might see something like
MONDAY: 30 in stock.
TUESDAY: 20 in stock.
WEDNESDAY: 50 in stock.
THURSDAY: 40 in stock.
FRIDAY: 41 in stock.
Indicating that, between Tuesday and wednesday, 30 more were replenished, and on thursday, one was returned.
I am planning to use mean and standard deviation on Daily sales for given rank.
So if any rank given I can predict the daily sales based on mean and standard deviation values.
Is this correct approach? IS there any better approach for this scenario
Sounds like this could be a good read for you, fpp
It provides an introduction to timeseries forecasting. Timeseries forecasting
has a lot of nuance so it can trip people up pretty easily. Some of the issues
you have already noted (e.g. seasonality). Others pertain to the statistical
properties of such series of data. Take a look through this and

Role of frequency parameter in ts

How does the ts() function use its frequency parameter? What is the effect of assigning wrong values as frequency?
I am trying to use 1.5 years of website usage data to build a time series model so that I can forecast the usage for coming periods. I am using data at daily level. What should be the frequency here - 7 or 365 or 365.25?
The frequency is "the" period at which seasonal cycles repeat. I use "the" in scare quotes since, of course, there are often multiple cycles in time series data. For instance, daily data often exhibit weekly patterns (a frequency of 7) and yearly patterns (a frequency of 365 or 365.25 - the difference often does not matter).
In your case, I would assume that weekly patterns dominate, so I would assign frequency=7. If your data exhibits additional patterns, e.g., holiday effects, you can use specialized methods accounting for multiple seasonalities, or work with dummy coding and a regression-based framework.
Here, the frequency parameter is not a frequency that you can observe in the data of your time series. Instead, you have to specify the frequency at which samples of the time series were taken. In your case, this is simply 1 day, or 1.
The value you give here will influence the results you get later when running analysis operations (examples are average requests per time unit or fourier transformation to get the (real) frequencies in the data). E.g. if you wanted to get all your results in the unit of hours instead of in days, you would pass 24 instead of 1 as frequency, because your data samples were taken in a frequency of 24 hours.

Resources