Preprocessing tsibble to run time series models from fable package - r

I am trying to run some models on some monthly time series data. The times series data are not of equal length and also not starting/ending from/in the same month. What I have is a numeric month column and a numeric year column. I have created a time series from those two variables and made a tsibble out of it so that I can use the fable package. This is what I am doing to process the time series data,
I am posting a simulated data here.
# Packages
library(tidyverse)
library(tsibble)
library(fable)
library(fabletools)
# Simulated data
id <- c(rep (222, 28), rep(111, 36), rep(555, 16))
year <- c(rep(2014, 12), rep(2015, 12), rep(2016, 4),
rep(2014, 12), rep(2015, 12), rep(2016, 12),
rep(2015, 12), rep(2016, 4))
mnt <- c(seq(1, 12, by = 1), seq(1, 12, by = 1), seq(1, 4, by = 1),
seq(1, 12, by = 1), seq(1, 12, by = 1), seq(1, 12, by = 1),
seq(1, 12, by = 1), seq(1, 4, by = 1))
value <- rnorm(80, mean = 123, sd = 50)
dataf <- data.frame(id, mnt, year, value)
To make it a tsibble I am converting my month variable mnt into a character,
dataf$mnt[dataf$mnt == 1] <- "Jan"
dataf$mnt[dataf$mnt == 2] <- "Feb"
dataf$mnt[dataf$mnt == 3] <- "Mar"
dataf$mnt[dataf$mnt == 4] <- "Apr"
dataf$mnt[dataf$mnt == 5] <- "May"
dataf$mnt[dataf$mnt == 6] <- "Jun"
dataf$mnt[dataf$mnt == 7] <- "Jul"
dataf$mnt[dataf$mnt == 8] <- "Aug"
dataf$mnt[dataf$mnt == 9] <- "Sep"
dataf$mnt[dataf$mnt == 10] <- "Oct"
dataf$mnt[dataf$mnt == 11] <- "Nov"
dataf$mnt[dataf$mnt == 12] <- "Dec"
Adding month and year together
dataf %>% unite("time", mnt:year, sep = " ")
Make a tsibble
tsbl <- as_tsibble(dataf, index = time, key = id)
At this point, I am having this error,
> tsbl <- as_tsibble(dataf, index = time, key = id)
Error: `var` must evaluate to a single number or a column name, not a function
Call `rlang::last_error()` to see a backtrace.
The remaining codes are this,
# Fitting arima
fit <- tsbl %>%
fill_gaps(b = 0) %>%
model(
arima = ARIMA(value),
)
fit
# One month ahead forecast
fc <- fit %>%
forecast(h = 1)
fc
# Accuracy measure
accuracy_table <- accuracy(fit)
Any idea how to preprocess my data to run forecasting models from fable package?

You have two small issues where you are creating the time column. The first is that you aren't reassigning your results back to the dataf dataframe, but only posting results to the console. Resolving that will cure your error that you posted.
The next piece is that you'll need a compatible data type. A character isn't quite enough, and you'll want something like the tsibble function yearmonth() to get the job done. For that, you'll see I flipped the order of your unite() call.
The relevant piece:
dataf <- dataf %>% unite("time", c(year, mnt), sep = " ") %>%
mutate(time = yearmonth(time))

Related

Time Series Forecast using Arima in R

I am trying to forecast student behaviour by year. It isn't working, maybe because my data is too small. I'm using Arima; however, the trend line keeps showing a straight line which I'm not sure is right. Might be this because ARIMA shows ARIMA(0,0,0) with non-zero mean.
Year - General
Students - Numeric
How can I forecast a student's behaviour by year?
Small data set, but this is one way to go about it. Modelling each student separately (with student as a key), and using the tidyverts approach:
library(dplyr)
library(tidyr)
library(tsibble)
library(feasts)
library(fable)
Data set
df <- structure(list(Year = structure(c(1995, 1996, 1997), class = "numeric"),
Student1 = c(3, 1, 3), Student2 = c(2, 2, 2), Student3 = c(2,
3, 3), Student4 = c(2, 3, 2), Student5 = c(3, 3, 4)), row.names = c(NA,
3L), class = "data. Frame")
Tidy data
df <- df |> pivot_longer(names_to = "Student", cols = starts_with("Student")) |>
as_tsibble(index = Year, key = Student)
Visualise
df |> autoplot()
df |>
filter(Student == "Student1") |>
gg_tsdisplay(value, plot_type = 'partial')
Fit ARIMA
Stu_fit <- df |>
model(search = ARIMA(value, stepwise = FALSE))
Check fit
glance(Stu_fit)
Stu_fit$search
Stu_fit |>
filter(Student == "Student1") |>
gg_tsresiduals()
Forecast
Stu_fit |>
forecast(h = 5) |>
filter(.model == 'search') |>
autoplot()
Hope this is helps! :-)

Remove elements when getting the "missing element" error

I have a data set df that I subset into two list int1 and int2. Each one of the elements in this list represents a 10-day period for single individual (e.g., the first three elements in int1 represent three different 10-day periods for ID "A"). int2 is a bit different because it only has one 10-day period for ID "A" and "B". This is because the data for month 4 only has one 10-day period.
I also have a list l1 that contains two matrices.
Setting up the example data:
library(lubridate)
library(tidyverse)
library(purrr)
date <- rep_len(seq(dmy("01-01-2011"), dmy("10-04-2011"), by = "days"), 100)
ID <- rep(c("A","B"), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
df$julian <- yday(df$date)
df$month <- month(df$date)
int1 <- df %>%
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(month == "1") %>%
group_split()
int2 <- df %>%
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(ID, new) %>%
filter(month == "4") %>%
group_split()
m1 <- matrix(1:9, nrow = 3, ncol = 3)
m2 <- matrix(20:28, nrow = 3, ncol = 3)
l1 <- list(m1, m2)
In the following code, I use the objects that I have created above. g1 is created as a sequence that is used in lstMat.
f1 is a function that does a number of calculations between int1 and int2 is also created to be used in lstMat.
g1 <- rep(seq_along(l1), sapply(l1, nrow))
# Function to calculate the number of julian dates between the first date in the
# first interval and the first date of the second interval
f1 <- function(.int1, .int2) {
t(outer(seq_along(.int1), seq_along(.int2),
FUN = Vectorize(function(i, j) {
min(.int1[[i]]$jDate) -
min(.int2[[j]]$jDate)
})
))
}
g1 <- rep(seq_along(l1), sapply(l1, nrow))
# Function to calculate the number of julian dates between the first date in the
# first interval and the first date of the second interval
f1 <- function(.int1, .int2) {
t(outer(seq_along(.int1), seq_along(.int2),
FUN = Vectorize(function(i, j) {
min(.int1[[i]]$jDate) -
min(.int2[[j]]$jDate)
})
))
}
This is the section of my script that I have been getting an error on.
lstMat <- purrr::map2(split(int1[seq_len(length(g1))], g1),
split(int2[seq_len(length(g1))], g1), f1)
Here is the error:
Error in `stop_subscript()`:
! Can't subset elements that don't exist.
x Locations 3, 4, 5, and 6 don't exist.
i There are only 2 elements.
Run `rlang::last_error()` to see where the error occurred.
I think the error is occuring due to the mismatch in length between g1 and int2 when trying to create the lstMat object. I was wondering how I could modify the code to remove those missing elements from g1 when I try to split int2 based on the g1 when running lstMat.

Getting the last 6 months data in R

i had data frame which contain many row. Here it is:
library(lubridate)
date_ <- date(seq(ymd_h("2020-01-01-00"), ymd_h("2021-03-31-23"), by = "hours"))
hour_ <- hour(seq(ymd_h("2020-01-01-00"), ymd_h("2021-03-31-23"), by = "hours"))
game1 <- sort(round(runif(length(date_), 10, 50), 0), decreasing = TRUE)
game2 <- sort(round(runif(length(date_), 20, 100), 0), decreasing = TRUE)
game3 <- sort(round(runif(length(date_), 30, 150), 0), decreasing = TRUE)
game4 <- sort(round(runif(length(date_), 40, 200), 0), decreasing = TRUE)
game_data <- data.frame(date_, hour_, game1, game2, game3, game4)
I just want to subset game_data to get all the last 6 months data. How do i get it?
Last 6 months of data from the max date in the data ?
You can try -
library(lubridate)
library(dplyr)
result <- subset(game_data, date_ > max(date_) %m-% months(6))
Or with dplyr -
result <- game_data %>% filter(date_ > max(date_) %m-% months(6))

Combine outputs of a function for each index in a for loop in R

I have created a function which completes a calculation based on data from two data frames for an individual.
I want to complete that function for each individual and combine all the outputs in a data frame and export to .csv
Currently the output .csv only has data for person 34, none of the other.
I've noted that when I run this it creates an object ID, which is just the numeric 34. It seems to be only holding on to the last ID in data$ID.
How can I create an output with results for all persons?
library(dplyr)
library(lubridate)
library(date)
screen_function = function(x){
# Select each person and get necessary inputs
data = data%>%filter(ID == x)
demogs = demogs %>% filter (P_ID == x)
data$Age = demogs$Age
data$result = data$test * data$Age
data$Date = as.Date(data$Date,format='%d/%m/%Y') # ensures date column is in correct format
# only include tests from most recent 12-24 months and only proceed if test in last 12 months
Recent_12m = data %>% filter(between(Date, today() - years(1), today()))
Recent_24m = data %>% filter(between(Date, today() - years(2), today()))
if ((nrow(Recent_12m)) > 0) {
data = rbind(Recent_12m,Recent_24m)
Recent_12m$min_date = min(Recent_12m$Date)
Recent_12m$Date = as.Date(Recent_12m$Date,format='%d/%m/%Y')
Recent_12m$min_date = as.Date(Recent_12m$min_date,format='%d/%m/%Y')
Recent_24m$min_date = min(Recent_24m$Date)
Recent_24m$Date = as.Date(Recent_24m$Date,format='%d/%m/%Y')
Recent_24m$min_date = as.Date(Recent_24m$min_date,format='%d/%m/%Y')
Recent_12m$Period = interval(Recent_12m$min_date, Recent_12m$Date)
Recent_12m$Years = as.numeric(Recent_12m$Period, unit = "years")
Recent_24m$Period = interval(Recent_24m$min_date, Recent_24m$Date)
Recent_24m$Years = as.numeric(Recent_24m$Period, unit = "years")
# Latest result
Last = filter(Recent_12m, (Recent_12m$Date == max(Date)))
# linear regression model
Reg_12month <- lm(result ~ Years, data=Recent_12m)
Reg_24month <- lm(result ~ Years, data=Recent_24m)
info = c(x, round(Last$result, digits=1), round(Reg_12month$coefficients["Years"], digits = 1), round(Reg_24month$coefficients["Years"], digits = 1))
newdf = data.frame(matrix(0, ncol = 4))
colnames(newdf) = c("ID", "Latest result", "Trend 12month", "Trend 24 month")
newdf= rbind(newdf, info)
write.csv(newdf, "filepath.csv")
}
}
Date= sample(seq(as.Date('2019/11/01'), as.Date('2020/11/01'), by="day"), 12)
ID= c(12,12,12,450,450,450,1,1,1,34,34,34)
test= rnorm(12, mean=150, sd=60)
data= data.frame(ID, Date, test)
P_ID = c(1,12,34,450)
Age = c(50,45,60,72)
demogs = data.frame(P_ID, Age)
persons = unique(data$ID)
for(ID in persons){
screen_function(paste("", ID,"", sep=""))
}
Created on 2020-11-16 by the reprex package (v0.3.0)
So, I've got around this by using a pre-made .csv, instead of creating a new dataframe. The .csv just contains a single row of 4 columns, with random entries in each cell.
newdf= read.csv(file = "filepath.csv")
info = c(x, round(Last$result, digits=1), round(Reg_12month$coefficients["Years"], digits = 1),
round(Reg_24month$coefficients["Years"], digits = 1))
newdf= rbind(Summary, patient_info)
colnames(newdf) = c("ID", "Latest result", "Trend 12month", "Trend 24 month")
newdf= distinct(newdf, ID, .keep_all = TRUE)
write.csv(Summary, "filepath.csv", row.names = FALSE)}}

Creating all possible variable combinations in R

I am having a daily dataset of 4 parameters which I have converted into monthly data using following code
library(zoo)
library(hydroTSM)
library(lubridate)
library(tidyverse)
set.seed(123)
df <- data.frame("date"= seq(from = as.Date("1983-1-1"), to = as.Date("2018-12-31"), by = "day"),
"Parameter1" = runif(length(seq.Date(as.Date("1983-1-1"), as.Date("2018-12-31"), "days")), 15, 35),
"Parameter2" = runif(length(seq.Date(as.Date("1983-1-1"), as.Date("2018-12-31"), "days")), 11, 29),
"Parameter3" = runif(length(seq.Date(as.Date("1983-1-1"), as.Date("2018-12-31"), "days")), 50, 90),
"Parameter4" = runif(length(seq.Date(as.Date("1983-1-1"), as.Date("2018-12-31"), "days")), 0, 27))
Monthly_data <- daily2monthly(df, FUN=mean, na.rm=TRUE)
After that, I have reshaped it to represent each column as month using following code
#Function to convert month abbreviation to a numeric month
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
Monthly_data %>%
dplyr::as_tibble(rownames = "date") %>%
separate("date", c("Month", "Year"), sep = "-", convert = T) %>%
mutate(Month = mo2Num(Month))%>%
tidyr::pivot_longer(cols = -c(Month, Year)) %>%
pivot_wider(names_from = Month, values_from = value, names_prefix = "Mon",
names_sep = "_") %>%
arrange(name)
Now, I want to create parameter combinations like Parameter1 * Parameter2, Parameter1 * Parameter3, Parameter1 * Parameter4, Parameter2 * Parameter3, Parameter2 * Parameter4, Parameter3 * Parameter4 which will be added to the pivoted monthly data as rbind. The new dataframe Parameter1 * Parameter2 means to multiply their monthly values and then rbind to the above result. Likewise for all other above said combinations. How can I achieve this?
You can use this base R approach using combn assuming data is present for all the years for all parameters where df1 is the dataframe from the above output ending with arrange(name).
data <- combn(unique(df1$name), 2, function(x) {
t1 <- subset(df1, name == x[1])
t2 <- subset(df1, name == x[2])
t3 <- t1[-(1:2)] * t2[-(1:2)]
t3$name <- paste0(x, collapse = "_")
cbind(t3, t1[1])
}, simplify = FALSE)
You can then rbind it to original data.
new_data <- rbind(df1, do.call(rbind, data))

Resources