Time series with multiple stores multiple products - r

I have a 300 stores and 20000 products in my data set and I want to predict next 3 months sales forecast for each outlet each product level. And my data frame looks like this (sample) like this data frame I m taking from SQL Server 2016
Date outlet produ price
2019-Jan A W 10
2019-Feb A R 20
2019-Feb A W 15
2019-Jan B W 30
2019-Jan B F 40
2019-Feb B W 40
what I tried is get the single product observation entire time series and set to model and get the output
##getthe data set like this
outlet <-c('A','A','B','B')
produ <-c('W','R','W','F')
price <-c(10,20,30,40)
df <- data.frame(outlet,produ,price)
##tried to get single product
dpSingle <- dplyr::filter(df,df$produ == 'W')
data.ts=ts(Quntity, start=c(year,month), frequency=12)
fit_arima <- auto.arima(data.ts,d=1,D=1,stepwise = FALSE
,approximation = FALSE, trace = TRUE)
fcast<-forecast(fit_arima,h=24)
autoplot(fcast) + ggtitle("Forecasted for next 24 months")+
ylab("quntity")+xlab("Time in days")
print((exp(fcast$mean )))
but what i want is loop through a data frame and identify outlet first and then product and get particulars observations with features and pass to my time series model and get predictions individually for each outlet each product.

Related

In R, categorize time series data based on regex

My data is organised as follows: for each product, there is a tax rate for each year as well as a base year tax (baseyr).
product<-c("01","02","03","04")
baseyr <-c("10","8 GBP/tonne","8GBP/tonne + 8GBP/tonne","8")
yr1<-c("5","5 GBP/tonne","5GBP/tonne + 10GBP/tonne","5 + 5GBP/tonne")
yr2<-c("3","3GBP/tonne + 6GBP/tonne","3 GBP/tonne","3 + 5GBP/tonne")
yr3<-c("2","2","2GBP/tonne + 2GBP/tonne","excluded")
sched<-data.frame(product,baseyr,yr1,yr2,yr3)
For each year, I need to classify each product by tax type in a new column based on the following conditions:
#number -> only numbers in the tax
#nonnumber -> numbers and strings in the tax
#mixed -> either two strings or number and string; the two strings are specified by a plus sign
#baseyr -> if the tax is "excluded" from the list, the tax to be used should be the value in base year, and the classification based on this
So if there are 3 years I need to generate 3 tax type columns. However the number of years changes randomly per dataset so I need to code with this in mind. My code is currently something like this:
yearnum<-3 #set number of years; it is between 1 and around 10 but there is no limit
schedule<-c(paste0("yr",1:yearnum)
tax<-c(paste0(schedule,"_tax")
for(i in 1:nrow(sched)){
#for each new tax type
for(j in tax){
#columns 3 to five where the yearly tax rates are
for(yr in 3:5){
#if the tax is excluded from the list, the base year tax should be used to determine the tax nature
if(sched[i,yr] =="excluded"){sched[i,yr] <- sched[i,baseyr]}
#if there is a plus sign it is a mixed tax
if(grepl("\\+",sched[i,yr])){sched[i,j] <- "mixed"}
#if it is not mixed but contains strings it is a nonnumber tax
if(grepl("[:alpha:]",sched[i,yr])){sched[i,j] <- "nonnumber"}
#finally if it is neither of the above it must be a number tax
if(is.na(sched[i,j])){sched[i,j] <- "number"}
}}}
NOTE: I do not know at the start how many years there will be in total; this has to be generated in the code. Any advice much appreciated, especially to avoid these for loops that don't seem to work properly for me.
The final output should be:
#so the output should be:
yr1_tax<-c("number","nonnumber","mixed","mixed")
yr2_tax<-c("number","mixed","nonnumber","mixed")
yr3_tax<-c("number","number","mixed","number")
#and the final dataframe:
sched<-data.frame(product,baseyr,yr1,yr2,yr3,yr1_tax,yr2_tax,yr3_tax)
You could use if_else to change all the excluded into baseyr. Then use case when with regular expressions as shown below:
sched %>%
mutate(
across(starts_with('yr'), ~ifelse(.x == 'excluded', baseyr, .x),
.names = '{.col}_tax'),
across(ends_with('tax'),
~case_when(grepl("^\\d+$", .x) ~ 'number',
grepl('^[^+]$', .x)~'nonnumber',
grepl('[+]', .x)~'mixed')))
product baseyr yr1 yr2 yr3 yr1_tax yr2_tax yr3_tax
1 01 10 5 3 2 number number number
2 02 8 GBP/tonne 5 GBP/tonne 3GBP/tonne + 6GBP/tonne 2 nonnumber mixed number
3 03 8GBP/tonne + 8GBP/tonne 5GBP/tonne + 10GBP/tonne 3 GBP/tonne 2GBP/tonne + 2GBP/tonne mixed nonnumber mixed
4 04 8 5 + 5GBP/tonne 3 + 5GBP/tonne excluded mixed mixed number
The regex is simplified in that I am checking for digits only (number), If there is + then mixed, then if no + then nonnumber.

Backtesting in R for time series

I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}
You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065

how to compound the interest monthly in recurring deposit calculation?

how to calculate recurring deposit in monthly basis?
M = ( R * [(1+r)n - 1 ] ) / (1-(1+r)-1/3)
M is Maturity value
R is deposit amount
r is rate of interest
n is number of quarters
if i take 'n' as 4(no of Quarters) for 1 year its showing yearly Maturity value.can anyone tel me how to do monthly calculation.Thanks
I'm not sure what the 1/3 is doing in the denominator, could you explain that? As integer division it will likely evaluate to 0 anyway.
That said, the formula for payment at the end of each payment interval is indeed
M = R * ( (1+r/p)^n-1 )/( (1+r/p) -1) = R * p/r * ( (1+r/p)^n-1 )
resulting in M = 125365.3694 for the given data;
and for payments at the start of each payment interval (month, quarter, ...)
M = R * (1+r/p)*( (1+r/p)^n-1 )/( (1+r/p) -1) = R * (p/r+1) * ( (1+r/p)^n-1 )
resulting in M = 126357.8452 for the given data.
Here p is the number of parts of the year that is used, i.e., p=4 for quarterly and p=12 for monthly, n is the number of payments, i.e., the payment schedule lasts n/p years, and then r is the nominal annual interest rate, used in r/p to give the interest rate over each part of the year.
Note that the effective interest rate (1+r/p)^p-1 depends on p, for p=1 it is r, for very large p it approaches exp(r)-1.
A more realistic result is obtained by taking the number of days in each month into account
days:=[31,28,31,30,31,30,31,31,30,31,30,31];
for k in [1..12] do
sum:=0;
for j in [1..12] do
sum+:=1;
sum*:=1+days[(j+k-1) mod 12 + 1]*0.095/365;
end for;
k, sum*10000;
end for;
gives as result the maturity value if started in month[k], with k=1 corresponding to january
1 126402.9195
2 126324.3970
3 126343.1642
4 126329.4573
5 126348.2653
6 126334.5983
7 126353.4478
8 126372.4494
9 126358.9711
10 126378.0173
11 126364.5825
12 126383.6740

Sampling in stages in R

I am running some sampling simulations from census data and I would like to sample in 2 stages.
First I want to sample 25 households within each village.
Second I want to sample 1 person from each household.
My data is in long format, with a village identifier, a household identifier, and a binary disease status (0 = healthy, 1 = diseased). The following code runs a monte-carlo simulation to sample 25 individuals per village 3000 times and record the number of malaria-positive individuals sampled.
But, I would like to sample 1 individual from 25 sampled households from each village. I can't figure it out.
Here is the link to my data:
d = read.table("data.txt", sep=",", header=TRUE)
villages = split(d$malaria, d$villageid)
positives = vector("list", 3000)
for(i in 1:3000) {
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}
How about this?
replicate(3000, sum(sapply(lapply(villages, sample, 25), sample, 1)))
lapply(villages, sample, 25) -> gives 25 households for all 177 villages
sapply(., sample, 1) -> sample 1 person from these 25 people from each of 177 villages
sum(.) -> sum the sampled values
replicate -> repeat the same function 3000 times
I figured out a workaround. It is quite convoluted and involves taking the data and creating another dataset. (I did this in Stata as my R capabilities are limited.) First I sort the dataset by house number and load that into R (d.people). Then I create a new dataset by collapsing the old dataset by house number, and load that into R (d.house). I do the sampling in 2 stages, first sampling 1 person from each household in the people dataset. I can then sample 25 "household sampled people" from each village after combining the houses dataset with the output from sampling 1 person from each household.
d.people = read.table("people data", sep=",", header=TRUE)
d.houses = read.table("houses data", sep=",", header=TRUE)
for(i in 1:3000){
houses = split(d.people$malaria, d.people$house)
firststage = sapply(houses, sample, 1)
secondstage = cbind(d.houses, firststage)
villages = split(secondstage$firststage, secondstage$village)
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}

Select single column from array returned by GetQuote extension

Ted Schlossmacher's free GetQuote extension for OpenOffice.org Calc allows users to access quotes for several types of symbols tracked by Yahoo! Finance. Specifically, the GETHISTORY() function returns a range of past and present quotes.
After installing the extension, try highlighting a 5-column range and then typing =GETHISTORY("PETR4.SA",1,TODAY()-1) (you might need to use semicolons instead of commas) and then pressing Ctrl+Shift+Return. That should provide you with date, open, high, low and close quotes for PETR4, the preferred stock of Brazilian oil giant Petrobras S.A.
My question is: how can I, in one cell, insert a formula that would return me the value of the 5th column of the above array?
This can be done with the INDEX function. You don't need to use ctrl+shift+enter for it to work as it does't return an array.
=INDEX(GETHISTORY("PETR4.SA",1,TODAY()-1),1,5)
The 2 end parameters are row,column, and are a 1-based index into the array.
More information about INDEX can be found on any Excel website, or in the LibreOffice Calc help at https://help.libreoffice.org/Calc/Spreadsheet_Functions#INDEX
Yesterday's closing price can be retrieved using a second argument, for example:
=GETQUOTE("TD.TO",21)
From the manual:
GETQUOTE can fetch 31 types of quotes. The types are numbered from 0 to 30. The function accepts these numbers as the second argument.
0 = Last traded price
1 = Change in price for the day
2 = Opening price of the day
3 = High price of the day
4 = Low price of the day
5 = Volume
6 = Average Daily Volume
7 = Ask Price
8 = Bid Price
9 = Book Value
10 = Dividend/Share
11 = Earnings/Share
12 = Earnings/Share Estimate Current Year
13 = Earnings/Share Estimate Next Year
14 = Earnings/Share Estimate Next Quarter
15 = 52-week low
16 = Change from 52-week low
17 = 52-week high
18 = Change from 52-week high
19 = 50-day Moving Average
20 = 200-day Moving Average
21 = Previous Close
22 = Price/Earning Ratio
23 = Dividend Yield
24 = Price/Sales
25 = Price/Book
26 = PEG Ratio
27 = Price/EPS Estimate Current Year
28 = Price/EPS Estimate Next Year
29 = Short Ratio
30 = 1-year Target Price
If you need only the latest price (which is the fifth field) I believe you can simply use:
=GETQUOTE("PETR4.SA")
I'm not certain this works to return the current price when markets are open, but it does return the last trade price when markets are closed.

Resources