I am very new to time series forecasting using Prophet in R. I am able to predict values for one single product using Prophet. Is there any way if i can use loop to generate forecast using Prophet for multiple products? The below code works absolutely fine for single product but i am trying to generate forecasts for multiple products
library(prophet)
df <- read.csv("Prophet.csv")
df$Date<-as.Date(as.character(df$Date), format = "%d-%m-%Y")
colnames(df) <- c("ds", "y")
m <- prophet(df)
future <- make_future_dataframe(m, periods = 40)
tail(future)
forecast <- predict(m, future)
write.csv(forecast[c('ds','yhat')],"Output_Prophet.csv")
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
Sample Dataset:
This can be done by using lists and map functions from the purrr package.
Lets build some data:
library(tidyverse) # contains also the purrr package
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
tb2 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
ts_list <- list(tb1, tb2) # two separate time series
# using this construct you could add more of course
Build and prediction:
library(prophet)
m_list <- map(ts_list, prophet) # prophet call
future_list <- map(m_list, make_future_dataframe, periods = 40) # makes future obs
forecast_list <- map2(m_list, future_list, predict) # map2 because we have two inputs
# we can access everything we need like with any list object
head(forecast_list[[1]]$yhat) # forecasts for time series 1
[1] 179.5214 198.2375 182.7478 173.5096 163.1173 214.7773
head(forecast_list[[2]]$yhat) # forecast for time series 2
[1] 172.5096 155.8796 184.4423 133.0349 169.7688 135.2990
Update (just the input part, build and prediction part it's the same):
I created a new example based on OP request, basically you need to put everything again in a list object:
# suppose you have a data frame like this:
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
productA = sample(365),
productB = sample(365)
)
head(tb1)
# A tibble: 6 x 3
ds productA productB
<date> <int> <int>
1 2018-01-01 105 287
2 2018-01-02 287 71
3 2018-01-03 149 7
4 2018-01-04 320 148
5 2018-01-05 340 175
6 2018-01-06 17 152
# with some dplyr and base R you can trasform each time series in a data frame within a list
ts_list <- tb1 %>%
gather("type", "y", -ds) %>%
split(.$type)
# this just removes the type column that we don't need anymore
ts_list <- lapply(ts_list, function(x) { x["type"] <- NULL; x })
# now you can continue just like above..
Related
So I clean revenue data every quarter and I need to do the two quarter moving average to predict the next five year quarterly revenue for each individual product (I know this will just end up being the same average for now). Attached here is the data frame: Revenue Df
Right now I have the data in wide format, and you'll see I created the empty forecasting columns by have the user enter a start and end date for the forecast, then it creates the columns for every quarter between. How can I fill these forecast using a moving average? I also converted it to long, and still could not figure out how to fill the forecast. Also I know the 9-30-2020 shows in the forecast, we want to replace that with the actuals even if the user inputs that date for the forecast.
for(i in ncol(Revenue_df)){
if(i<3)
{Revenue_df[,i]<- Revenue_df[,i]}
else{
Revenue_df[,i]<-(Revenue_df[,i-1]+Revenue_df[,i-2])/2
}
}
Product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame(Product,Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
forecast.sequence<- c("2020-09-30","2020-12-31","2021-03-31","2021-06-30","2021-09-30","2021-12-31","2022-03-31"
"2022-06-30","2022-09-30","2022-12-31","2023-03-31","2023-06-30","2023-09-30","2023-12-31","2024-03-31"
"2024-06-30","2024-09-30","2024-12-31")
forecast.sequence.amount<- paste("FC.Amount.",forecast.sequence)
revenue[,forecast.sequence.amount]<-NA
I tried this code and it did not work, any suggestions? Also attached is the code for the sample data frame shown in the picture, sorry for the bad format this is my second time asking a question on here.
This seems to be a bit simple for a product forecast. You might want to look at the forecast and fable packages for forecast functions that can account for trends and seasonality in forecasts. These would, however, require for than two data points of data. Anyway, taking your problem as given, the following code seems to do what you describe.
EDIT
I've made the forecast calculation a function to make it more straightforward to use.
library(tidyverse)
product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame( Product = product, Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
rev_frcst <- function(revenue, frcst_end, frcst_prefix) {
#
# Arguments:
# revenue = data frame with
# Product containing product name
# columns with the format "prefix.m_day_year" containing product quantities for past quarters
# frcst_end = end date for quarterly forecast
# frcst_prefix = string containing prefix for forecast
#
# convert revenue to long format
#
rev_long <- revenue %>% pivot_longer(cols = -Product, names_to = "Quarter", values_to = "Revenue") %>%
mutate(quarter_end = as.Date(str_remove(Quarter,"Revenue."), "%m_%d_%Y"))
num_revenue <- nrow(rev_long)/length(product)
#
# generate forecast dates
#
forecast.sequence <- seq( max(rev_long$quarter_end),
as.Date(frcst_end),
by = "quarter")[-1]
#
# Add forecast rows to data
#
rev_long <- rev_long %>%
bind_rows(expand_grid(Product=unique(revenue$Product), quarter_end = forecast.sequence) %>%
mutate(Quarter = paste(frcst_prefix, quarter_end)) ) )
#
# Define moving average function
#
mov_avg <- function(num_frcst, x) {
y <- c(x, numeric(num_frcst))
for(i in 1:num_frcst + 2) {
y[i] <- .5*(y[i-1] + y[i-2]) }
y[1:num_frcst + 2]
}
#
# Calculate forecast
#
rev_long_2 <- rev_long %>% group_by(Product) %>%
mutate(forecast = c(Revenue[1:num_revenue],
mov_avg(num_frcst =length(forecast.sequence),
x = Revenue[1:2 + num_revenue - 2]))) %>%
arrange(Product, quarter_end)
}
#
# call rev_frcst to calcuate forecast
#
rev_forecast <- rev_frcst(revenue=revenue,
frcst_end = "2024-12-31",
frcst_prefix = "FC.Amount.")
which gives
Product Quarter Revenue quarter_end forecast
<chr> <chr> <dbl> <date> <dbl>
1 a Revenue.3_30_2020 50 2020-03-30 50
2 a Revenue.6_30_2020 50 2020-06-30 50
3 a Revenue.9_30_2020 25 2020-09-30 25
4 a FC.Amount. 2020-12-30 NA 2020-12-30 37.5
5 a FC.Amount. 2021-03-30 NA 2021-03-30 31.2
6 a FC.Amount. 2021-06-30 NA 2021-06-30 34.4
7 a FC.Amount. 2021-09-30 NA 2021-09-30 32.8
8 a FC.Amount. 2021-12-30 NA 2021-12-30 33.6
9 a FC.Amount. 2022-03-30 NA 2022-03-30 33.2
10 a FC.Amount. 2022-06-30 NA 2022-06-30 33.4
I am trying to do forecasting using Prophet in R using list for multiple items. I am able to get output but when I export to CSV it should be neater and cleaner. Want the output to have date(ds) in one column, Products in 2nd column and values(yhat in 3rd column)
tb=as_tibble(data_input)
tslist <- tb %>% gather("type", "y", -ds) %>% split(.$type)
tslist <- lapply(tslist, function(x) { x["type"] <- NULL; x })
maplist <- map(tslist, prophet)
futurelist <- map(maplist, make_future_dataframe, periods = 12, freq = 'month')
forecast <- map2(maplist, futurelist, predict)
write.csv(forecast,"Output_Prophet.csv")
Any leads to the above code which can provide output in below format.
Ds Product Yhat
01/01/2017 A 8
01/02/2017 A 9
01/01/2017 B 14
01/02/2017 B 18
My current output looks like below format:
I am creating three types of variables over multiple time periods for individual customer IDs in my data. These new variables include the sum of a price vector, the mean of a price vector, and the mean difference between successive dates in a date vector.
Using data.table, I am looping through multiple time periods, subsetting the data in each period, and calculating those variables for individual customer IDs. Each of these variables are named dynamically as I loop over the time periods. As it stands, these variables are being computed correctly.
Here is where I am getting stuck: after all of these variables are computed, I would like to subset the data to include the new aggregated variables along with the most recent purchase.price and date elements for each customer.
I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's i index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.
In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.
This question stems from a similar problem noted where we are subsetting over fewer variables that are not being created dynamically.
library(lubridate)
library(data.table)
library(dplyr)
data <- data.frame(custid = c(rep(1, 25), rep(2, 25), rep(1, 25), rep(2, 25)),
purchase.price = seq(1, 200, by=2),
date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-04-10"), by="days"))
period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-30")),
period_two = interval(as.Date("2015-02-01"), as.Date("2015-02-28")),
period_three = interval(as.Date("2015-03-01"), as.Date("2015-03-31")),
period_four = interval(as.Date("2015-04-01"), as.Date("2015-04-28")))
data <- as.data.table(data)
data <- data[order(date)]
setkey(data, custid)
time_periods <- c(1:4)
for(i in time_periods[1]:max(time_periods)){
data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.sum", sep="."):= sum(purchase.price),
by = custid]
data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.mean", sep="."):= mean(purchase.price),
by = custid]
data <- data[date %within% period_intervals[[i]],
paste("period", i, "mean.diff.date", sep="."):= mean(as.numeric(diff(purchase.price))),
by = custid]
}
data_sub <- data %>%
group_by(custid) %>%
arrange(desc(date)) %>%
filter(row_number() == 1)
Current result from dplyr subsetting (showing the first 7 columns):
custid purchase.price date period.1.price.sum period.1.price.mean period.1.mean.diff.date period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 NA NA NA NA ...
2 1 149 2015-03-16 NA NA NA NA ...
Here is what I was hoping for (showing first 7 columns):
custid purchase.price date period.1.price.sum period.1.price.mean period.1.mean.diff.date period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 625 25 2 981 ...
2 1 149 2015-03-16 275 55 2 1539 ...
Note:
In my complete dataset, I am looping over anywhere between 10-20 time periods. The number of periods to be computed over is subject to change, thus my approach to dynamically create the new variables.
We can use Map as in the previous post
nm1 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.sum")
nm2 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.mean")
nm3 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "mean.diff.date")
data[, c(rbind(nm1, nm2, nm3)) := unlist(Map(function(x,y) {
x1 <- purchase.price[x %within% y]
list(sum(x1), mean(x1), mean(as.numeric(diff(x1))))},
list(date), period_intervals), recursive = FALSE), by = custid]
data[order(custid, -date)][,.SD[1] , custid]
I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:
Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")
And for the dates of each audit across different stores:
Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")
Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:
library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))
##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))
But I don't know how I would get this for multiple sites.
Try this:
# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")
library(tidyr)
library(dplyr)
# Gather df1 so that df1 and df2 have the same format:
df1 = gather(df1, Store, Sales, -t)
head(df1)
t Store Sales
1 2015-12-30 Store.A 16
2 2015-12-31 Store.A 24
3 2016-01-01 Store.A 8
4 2016-01-02 Store.A 42
5 2016-01-03 Store.A 7
6 2016-01-04 Store.A 46
# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:
r <- lapply(1:nrow(df2), function(i) {
audit.t = df2[i, "t"] #time of audit
audit.s = df1[, "Store"] == df2[i, "Store"] #store audited
df = df1[audit.s, ] #data from audited store
df[, "audited"] = audit.t #add extra column with audit date
week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
week_audit = difftime(df[, "t"], audit.t) <= 0
df[week_before & week_audit, ]
})
Does this give you the proper subsets?
Also, to summarise your results:
r = do.call("rbind", r) %>%
group_by(audited, Store) %>%
summarise(sales = sum(Sales))
r
audited Store sales
<time> <chr> <int>
1 2016-01-04 Store.A 97
2 2016-02-01 Store.B 156
3 2016-02-01 Store.C 226
4 2016-03-01 Store.A 115
5 2016-03-01 Store.C 187
This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?
I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082
Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)
Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.