Statistical analysis on daily data - math

I have a number of data points that I am trying to extract a meaningful pattern from (or derive an equation that could then be predictive). I am trying to find a correlation (?) between RANK and DAILY SALES for any given ITEM.
So, for any given item, I have (say) two weeks of daily information, each day consists of a pairing of Inventory, and Rank.
ITEM #1
Monday: 20 in stock (rank 30)
Tuesday: 17 in stock (rank 29)
Wednesday: 14 in stock (rank 31)
The presumption is that 3 items were sold each day, and that selling ~3 a day is roughly what it means to have a rank of ~30.
Given information like this across a wide span (20,000 items, over 2 weeks) of inventory/rank/date pairings, I'd like to derive an equation/method of estimating what the daily sales would be for any given rank.
There's one problem:
The data isn't entirely clean, because -occasionally- the inventory fluctuates upward, either because of re-stocking, or because of returns. So for example, you might see something like
MONDAY: 30 in stock.
TUESDAY: 20 in stock.
WEDNESDAY: 50 in stock.
THURSDAY: 40 in stock.
FRIDAY: 41 in stock.
Indicating that, between Tuesday and wednesday, 30 more were replenished, and on thursday, one was returned.
I am planning to use mean and standard deviation on Daily sales for given rank.
So if any rank given I can predict the daily sales based on mean and standard deviation values.
Is this correct approach? IS there any better approach for this scenario

Sounds like this could be a good read for you, fpp
It provides an introduction to timeseries forecasting. Timeseries forecasting
has a lot of nuance so it can trip people up pretty easily. Some of the issues
you have already noted (e.g. seasonality). Others pertain to the statistical
properties of such series of data. Take a look through this and

Related

Analyzing disparate time series in R

Are there tools in R that simplify analysis of lagged and disparate time series. For example:
Daily values that only occur on weekdays (no entry on weekends or holidays)
vs
Bi-annual values
What I'm seeking is ways to:
Complete the missing daily values (with interpolated, or last value rolled forward, etc.)
Look for correlation between daily values and the bi-annual value (only the values that came before the bi-annual event)
As an example:
10-year treasury note interest rate (daily on non-holiday weekdays) as "X" and i-bond fixed rate as "Y" (set May 1/Nov 1)
Any suggestions appreciated.
I've built a test dataset manually for "x" and used functions in zoo to populate the missing values (interpolated), but I'm hoping for a less "brute-force" method for looking at analyzing the disparate time series. I've used lag functions in the past, but those were on matching interval time series.
What Jon commented is what I had in mind:
expand a weekday time series to full week using missing value function(s) in zoo
Sample the daily value - say April 15 for the May 1, Oct 15 for Nov 1
Ideally be able to automate - say loop through April 1-30, Oct 1-30 to look for highest RSqr for the model of choice (linear, polynomial, etc.)
Not have to build discrete datasets for each of the above - but if that is what is required I can do it programmatically - I've done that with stock data in the past. I was looking for a more efficient means of selecting the datasets ad hoc during the analysis.
I don't have code to post, because I'm clueless as to the feature/function that would make the date selection I'm after possible (at least in R).
Thanks for the input so far. It has already been useful in helping me look at alternative methods to achieve what I'm after.

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

Sort stock price volatility data into deciles on a monthly rolling basis and calculate the return on outcoming portfolios

As the title suggests I am doing some research about low volatility stocks for my Bachelor thesis.
I compiled the stock price quotes of German listed companies for 15 year and my goal is to build deciles of the stocks based on the volatility of stock prices in the prior month. This should happen on a rolling basis, i.e. every month new deciles. The deciles represent the portfolios and the code should also be able to give out the return of the different deciles over the 15 year period. Weighting of single stocks would be the same for the beginning.
My professor suggests to use R for the monthly rebalancing of the portfolios and everything else connected with the quantitative part of the thesis.
Now comes the problem. Unfortunately, I have absolutely now experience in coding and even though I watched some tutorials and am able to do some basic stuff in R, developing the code necessary for my problem is by far beyond my knowledge.
I really appreciate any help I can get on my problem and would be massively thankful for every hint.
Kind regard
Edit:
To have a more precise explanation of the problem, I will try to illustrate the problem below:
We have 100 stocks right now.
Month 1:
1.Decile: 10 Stocks with highest volatility in prior month grouped in one portfolio.
.
.
.
10.Decile: 10 Stocks with lowest volatility in prior month grouped in one portfolio.
Month 2:
1.Decile: 10 Stocks with highest volatility in prior month grouped in one portfolio.
.
.
.
10.Decile: 10 Stocks with lowest volatility in prior month grouped in one portfolio.
This sorting goes on for every month in the 15 year period. Obviously, every single stock can be in different deciles every month as it is set by its prior volatility.
Furthermore, the code should do as if I invest for example 1 dollar in the highest volatility portfolio, i.e. 10 cents in every stock. Then I hold the stock over the month and then the code needs to check whether there was a change in the 10 stocks in the highest vola portfolio and divest the stocks which are out and invest in the new in.
In the end, the code should give out the return I would have generated by following this investment strategy.
Also, this should be done for every of the deciles to compare the results for the different volatility deciles.
Hopefully, it is a bit more clear now. If there are still problems understanding it, feel free to tell me.
Thank you very much.

Role of frequency parameter in ts

How does the ts() function use its frequency parameter? What is the effect of assigning wrong values as frequency?
I am trying to use 1.5 years of website usage data to build a time series model so that I can forecast the usage for coming periods. I am using data at daily level. What should be the frequency here - 7 or 365 or 365.25?
The frequency is "the" period at which seasonal cycles repeat. I use "the" in scare quotes since, of course, there are often multiple cycles in time series data. For instance, daily data often exhibit weekly patterns (a frequency of 7) and yearly patterns (a frequency of 365 or 365.25 - the difference often does not matter).
In your case, I would assume that weekly patterns dominate, so I would assign frequency=7. If your data exhibits additional patterns, e.g., holiday effects, you can use specialized methods accounting for multiple seasonalities, or work with dummy coding and a regression-based framework.
Here, the frequency parameter is not a frequency that you can observe in the data of your time series. Instead, you have to specify the frequency at which samples of the time series were taken. In your case, this is simply 1 day, or 1.
The value you give here will influence the results you get later when running analysis operations (examples are average requests per time unit or fourier transformation to get the (real) frequencies in the data). E.g. if you wanted to get all your results in the unit of hours instead of in days, you would pass 24 instead of 1 as frequency, because your data samples were taken in a frequency of 24 hours.

How to prepare my data for a Neural Network training for a natural gas ( NATGAS ) price-predictions?

I want to create an NN to predict future prices in natural gas
I'm not sure it's a simple time series problem:
Each month ( so 12 of these) I follow a future spread ( e.g sep-oct ) until the front contract expires.
I start following it for approx 60 days ( data points ).
For each of the data points I have other inputs e.g. weather, inventories for the week, price of coal etc.
I have the previous 5 yrs data for each spread for each of the months of the year.
I want the NN to learn if it can predict the direction of the current months spread for the next x days of the 60 days for the particular months spread, given that I'll know weather, inventories, coal prices etc. -- the Features Vector -- at the moment of prediction.
Qus's I'd like to know if it can predict -"given this years inventories, weather patterns, coal price - where will the spread go in the next last 20 days of the contract?"
Is this suitable for a NN-based Predictor?
If so how should I be preparing my data?
I'm using MATLAB

Resources