Prophet Output formatting in R - r

I am trying to do forecasting using Prophet in R using list for multiple items. I am able to get output but when I export to CSV it should be neater and cleaner. Want the output to have date(ds) in one column, Products in 2nd column and values(yhat in 3rd column)
tb=as_tibble(data_input)
tslist <- tb %>% gather("type", "y", -ds) %>% split(.$type)
tslist <- lapply(tslist, function(x) { x["type"] <- NULL; x })
maplist <- map(tslist, prophet)
futurelist <- map(maplist, make_future_dataframe, periods = 12, freq = 'month')
forecast <- map2(maplist, futurelist, predict)
write.csv(forecast,"Output_Prophet.csv")
Any leads to the above code which can provide output in below format.
Ds Product Yhat
01/01/2017 A 8
01/02/2017 A 9
01/01/2017 B 14
01/02/2017 B 18
My current output looks like below format:

Related

Prophet Forecasting using R for multiple items

I am very new to time series forecasting using Prophet in R. I am able to predict values for one single product using Prophet. Is there any way if i can use loop to generate forecast using Prophet for multiple products? The below code works absolutely fine for single product but i am trying to generate forecasts for multiple products
library(prophet)
df <- read.csv("Prophet.csv")
df$Date<-as.Date(as.character(df$Date), format = "%d-%m-%Y")
colnames(df) <- c("ds", "y")
m <- prophet(df)
future <- make_future_dataframe(m, periods = 40)
tail(future)
forecast <- predict(m, future)
write.csv(forecast[c('ds','yhat')],"Output_Prophet.csv")
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
Sample Dataset:
This can be done by using lists and map functions from the purrr package.
Lets build some data:
library(tidyverse) # contains also the purrr package
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
tb2 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
ts_list <- list(tb1, tb2) # two separate time series
# using this construct you could add more of course
Build and prediction:
library(prophet)
m_list <- map(ts_list, prophet) # prophet call
future_list <- map(m_list, make_future_dataframe, periods = 40) # makes future obs
forecast_list <- map2(m_list, future_list, predict) # map2 because we have two inputs
# we can access everything we need like with any list object
head(forecast_list[[1]]$yhat) # forecasts for time series 1
[1] 179.5214 198.2375 182.7478 173.5096 163.1173 214.7773
head(forecast_list[[2]]$yhat) # forecast for time series 2
[1] 172.5096 155.8796 184.4423 133.0349 169.7688 135.2990
Update (just the input part, build and prediction part it's the same):
I created a new example based on OP request, basically you need to put everything again in a list object:
# suppose you have a data frame like this:
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
productA = sample(365),
productB = sample(365)
)
head(tb1)
# A tibble: 6 x 3
ds productA productB
<date> <int> <int>
1 2018-01-01 105 287
2 2018-01-02 287 71
3 2018-01-03 149 7
4 2018-01-04 320 148
5 2018-01-05 340 175
6 2018-01-06 17 152
# with some dplyr and base R you can trasform each time series in a data frame within a list
ts_list <- tb1 %>%
gather("type", "y", -ds) %>%
split(.$type)
# this just removes the type column that we don't need anymore
ts_list <- lapply(ts_list, function(x) { x["type"] <- NULL; x })
# now you can continue just like above..

Using lapply to output values between date ranges within different factor levels

I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:
Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")
And for the dates of each audit across different stores:
Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")
Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:
library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))
##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))
But I don't know how I would get this for multiple sites.
Try this:
# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")
library(tidyr)
library(dplyr)
# Gather df1 so that df1 and df2 have the same format:
df1 = gather(df1, Store, Sales, -t)
head(df1)
t Store Sales
1 2015-12-30 Store.A 16
2 2015-12-31 Store.A 24
3 2016-01-01 Store.A 8
4 2016-01-02 Store.A 42
5 2016-01-03 Store.A 7
6 2016-01-04 Store.A 46
# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:
r <- lapply(1:nrow(df2), function(i) {
audit.t = df2[i, "t"] #time of audit
audit.s = df1[, "Store"] == df2[i, "Store"] #store audited
df = df1[audit.s, ] #data from audited store
df[, "audited"] = audit.t #add extra column with audit date
week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
week_audit = difftime(df[, "t"], audit.t) <= 0
df[week_before & week_audit, ]
})
Does this give you the proper subsets?
Also, to summarise your results:
r = do.call("rbind", r) %>%
group_by(audited, Store) %>%
summarise(sales = sum(Sales))
r
audited Store sales
<time> <chr> <int>
1 2016-01-04 Store.A 97
2 2016-02-01 Store.B 156
3 2016-02-01 Store.C 226
4 2016-03-01 Store.A 115
5 2016-03-01 Store.C 187

How to Vectorize splitting a Date in R into Multiple Columns [duplicate]

This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?
I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082
Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)
Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.

Fiscal-year return and standard deviation from daily returns

I want to calculate fiscal year returns and standard deviations from daily returns for a large number of firms. I am relatively new to R, having previously used SAS to calculate returns etc. However, I'd like to switch to R in the short/medium-term.
I have two files: 1) Containing a firm identifier, dates, daily returns(df.1) and 2) my sample (df.2) over which I'd like to aggregate the returns
firm date ret
1 01/01/1992 0.024
1 02/01/1992 0.010
. . .
. . .
1 31/12/2014 0.002
2 01/01/1992 0.004
2 02/01/1992 0.012
The file is very large about 1M rows.
The second file looks like that:
firm fiscal_year_start fiscal_year_end
1 01/01/1992 31/12/1992
1 01/01/1993 31/12/1993
1 01/01/1994 31/12/1994
I want to calculate fiscal year returns and annualised standard deviation. Both .csv files are loaded into R as data frames. I am unsure on how to best treat the date variables and how to structure the for loop to loop through the daily return file.
Any help would be much appreciated.
EDIT1
I am able to subset the big data frame using this function:
myfunc <- function(x,y,z){df.1(df.1$date1 >= x & df.1$date1 < y & df.1$firm == firm1,]}
firm1 <- df.2$firm[1]
start_date <- df.2$StartDate[1]
end_date <- df.2$EndDate[1]
Test <- myfunc(start_date,end_date, firm1)
For this subset I can then get the fiscal-year return and std:
# return
fiscal_year_ret <- with(Test, sum(Test$ret))
# annualized variance
var <- with(Test, var(Test$ret))
annualized_var <- var*length(Test)
annualized_st.dev <- sqrt(annualized_var)
My big problem is embedding this into a loop that allows me to loop through the different firm identifiers and dates in df.2
EDIT2
So I have something like this
df.output <- data.frame(returns=as.numeric(),
std.deviation=as.numeric(),
stringsAsFactors=FALSE)
I would like to populate the above data frame with the results.
for (i in sample) {
myfunc <- function(x,y,z){df.1[df.1$date1 >= x & df.1$date1 < y & df.1$firm == firm1,]}
firm1 <- df.2$firm[i]
start_date <- df.2$StartDate[i]
end_date <- df.2$EndDate[i]
subset <- myfunc(start_date,end_date, firm1)
# return
fiscal_year_ret <- with(subset, sum(subset$ret))
df.output$returns <-fiscal_year_ret
# variance
var <- with(subset, var(subset$ret))
annualized_var <- var*length(subset)
annualized_st.dev <- sqrt(annualized_var)
}
Something like that.
Here is one way:
library(lubridate)
data %>%
mutate(year =
date %>%
mdy %>%
floor_date(unit = "year") )
group_by(year) %>%
summarize(
mean_return = mean(ret),
sd_return = sd(ret))

Find sum of daily variables in range of month dates in different data frame

We have external data with daily values (pulled in that format to DB) that need to be added up to approximate monthly values, that align with another external dataset showing approximate monthly values (date range of YYYY-MM-DD in separate columns in that set). The new data forms basis for linear regression.
We would like to use R to:
Add the daily date values in data1 in the range that meets monthly values of date range in data2. In other words, if "start" in Data2 = 2015-02-14 and "end" in Data2 = 2015-03-15, we want to know the total of daily values of a variable in columnX of data1 for the range 2015-02-14 to 2015-03-15.
can't figure out the logic function to automatically define parameters in Data1 from info in Data2. Is this equation close?
monthly=sum(data1$variable, if(data1$Date > Data2$StartDate & data1$Date < Data2$endDate))
data1$variable is in rows with data1$Date.
The error on the equation above = Error: unexpected ')'
We can not figure out how to build this argument. Any assistance would be appreciated!
Assume you have data1 and data2 as below
daysInData1 <- seq(as.Date('2013-03-1'), as.Date('2014-12-07'), by = 'day')
data1 <- data.frame(Date = daysInData1, variable = runif(length(daysInData1)))
daysInData2 <- seq(as.Date('2013-03-15'), as.Date('2015-03-14'), by = 'month')
data2 <- data.frame(StartDate = daysInData2, volume = seq(length(daysInData2)), )
They should look like
>data1
Date variable
1 2013-03-01 0.944390132092
2 2013-03-02 0.168255153345
3 2013-03-03 0.919271149905
4 2013-03-04 0.456344844541
5 2013-03-05 0.365338093136
6 2013-03-06 0.158996492159
(...omit the rest)
>data2
StartDate volume
1 2013-03-15 1
2 2013-04-15 2
3 2013-05-15 3
(...omit the rest)
You can use for loop to categorize data1 based on data2 date range
for( i in 1:nrow(data2))
{
data1[data1$Date >= data2[(i), 'StartDate'] & data1$Date < data2[nrow(data2), 'StartDate'], 'DateMonthlySeg'] <- data2[i, 'StartDate']
}
Then aggregate to get the sum within each category in data1, and merge with data2 (for regression convenience)
data2 <- merge(data2, aggregate(variable ~ DateMonthlySeg, data = data1, sum),
by.x = 'StartDate', by.y = 'DateMonthlySeg')
Finally, perform linear regression
>lm(volume~variable, data = data2)
Call:
lm(formula = volume ~ variable, data = data2)
Coefficients:
(Intercept) variable
10.33248635 0.04394532

Resources