Dynamic time series using TSA::arima and stats::arima - r

I am looking for more information to understand the difference between TSA::arimax and stats:arima when used for dynamic time series. I am interested in exploring the interplay between drinking and smoking rates in young people - treating smoking as the outcome variable.
Using the 2 commands (code below) produces the same results in my data - is this because I only have one IV and/or because I am not specifying any p or q values for the transfer variable?
I have seen online that TSA arimax is fitting a transfer function model rather than ARIMAX model but I am not sure how they differ.
alcohol.ts = ts(data=data$alcohol, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
iv[,1] = alcohol.ts
iv = as.data.frame(iv)
dv = ts(data=data$smoke, frequency=4, start=c(data[1, "Year"], data[1, "Quarter"]))
(model1 = stats::arima(dv, order=c(2,1,0), seasonal=list(order=c(0,0,0),
period=4), xreg=iv[,1],
transform.pars = FALSE, optim.control = list(maxit = 1000),
method='ML')
(model2 = TSA::arimax(dv, order=c(2,1,0), seasonal=list(order=c(0,0,0),
period=4),
xtransf=iv[,1], transfer=list(c(0,0)),
transform.pars = FALSE, optim.control = list(maxit = 1000),
method='ML'))

Related

Derivative a interaction term in GAM Model

I'm trying to model an estimation of the price elasticity of demand for each customer using GAM model, a model like this:
\ln D = \ln P + \ln P \cdot \sum_{i=1}^{20} f(X_i)
PED = \frac{\partial \ln D {\partial \ln P} = 1 + \sum_{i=1}^{20} f(X_i)
https://latex.codecogs.com/svg.image?$$&space;\ln&space;D&space;=\ln&space;P&space;+&space;\ln&space;P&space;\cdot&space;\sum_{i=1}^{20}&space;f(X_i)\\PED&space;=&space;\frac{\partial&space;\ln&space;D}{\partial&space;\ln&space;P}&space;=&space;1&space;+&space;&space;\sum_{i=1}^{20}&space;f(X_i)
where $D$ is Demand, $P$ is rate, PED is price elasticity of demand and $X_i$ is a set of customer's variable.
Since $PED$ is not observable, i want to estimate PED from the model created for log demand using gam model, but I have trying some difficulty in how to estimate that way.
I tried to get the each splines to calculate PED, but i failed. I know there is a package called gratia with derivatives function, but i dont understand how to use it to calculate ped.
Once the model to estimate demand is created, I will need to estimate the price elasticity of demand for each customer, but for these customers I don't have the rate variable, only the 20 personal variables.
I read some links:
https://stats.stackexchange.com/questions/495775/first-derivative-of-fitted-gam-changes-according-to-specified-model-distribution
https://stats.stackexchange.com/questions/590167/how-can-i-calculate-a-derivative-of-a-global-smooth-and-group-level-smooths-with
https://stats.stackexchange.com/questions/32013/what-is-the-mathematical-model-formula-corresponding-to-this-gam-model-fit-in-r
Really appreciate for any explanation, advices or other way to model my data.
Thanks
EDIT
What i've tried:
#create the dataset
A <- sample(x = 0:1000, size = 5000, replace = TRUE)
B <- sample(x = 0:1000, size = 5000, replace = TRUE)
C <- sample(x = 0:1000, size = 5000, replace = TRUE)
D <- sample(x = 0:1000, size = 5000, replace = TRUE)
log.R <- log(rbeta(5000, 5,10)*10) #log rate
log.Y <- log(rgamma(5000, 10, 20)*10000) #log demand
mydata <- data.frame(A, B, C, D, log.R, log.Y)
#the model
model <- gam(log.Y ~ s(A, by=log.R) + s(B, by=log.R) + s(C, by=log.R) + s(D, by=log.R), data = mydata, method = "REML")
mfx <- marginaleffects(model, variables = "log.R", eps = 10^-5)
head(mfx)
mfx returns a 'dydx' column, is it the elasticity of my data used to model?
And when i will apply this model to newdata, i got an error:
newdat = data.frame(A = 750, B = 500, C = 398, D = 740)
marginaleffects(model, variables = "log.R", eps = 10^-5, newdata= newdat, slope = 'dydx')
Error: There is no valid predictor variable. Please change the `variables` argument or supply a new data frame to the `newdata` argument.
What should I do?

Learning hidden markov model in R

A hidden Markov model (HMM) is one in which you observe a sequence of observations, but do not know the sequence of states the model went through to generate the observations. Analyses of hidden Markov models seek to recover the sequence of hidden states from the observed data.
I have data with both observations and hidden states (observations are of continuous values) where the hidden states were tagged by an expert. I would like to train a HMM that would be able - based on a (previously unseen) sequence of observations - to recover the corresponding hidden states.
Is there any R package to do that? Studying the existing packages (depmixS4, HMM, seqHMM - for categorical data only) allows you to specify a number of hidden states only.
EDIT:
Example:
data.tagged.by.expert = data.frame(
hidden.state = c("Wake", "REM", "REM", "NonREM1", "NonREM2", "REM", "REM", "Wake"),
sensor1 = c(1,1.2,1.2,1.3,4,2,1.78,0.65),
sensor2 = c(7.2,5.3,5.1,1.2,2.3,7.5,7.8,2.1),
sensor3 = c(0.01,0.02,0.08,0.8,0.03,0.01,0.15,0.45)
)
data.newly.measured = data.frame(
sensor1 = c(2,3,4,5,2,1,2,4,5,8,4,6,1,2,5,3,2,1,4),
sensor2 = c(2.1,2.3,2.2,4.2,4.2,2.2,2.2,5.3,2.4,1.0,2.5,2.4,1.2,8.4,5.2,5.5,5.2,4.3,7.8),
sensor3 = c(0.23,0.25,0.23,0.54,0.36,0.85,0.01,0.52,0.09,0.12,0.85,0.45,0.26,0.08,0.01,0.55,0.67,0.82,0.35)
)
I would like to create a HMM with discrete time t whrere random variable x(t) represents the hidden state at time t, x(t) {"Wake", "REM", "NonREM1", "NonREM2"}, and 3 continuous random variables sensor1(t), sensor2(t), sensor3(t) representing the observations at time t.
model.hmm = learn.model(data.tagged.by.user)
Then I would like to use the created model to estimate hidden states responsible for newly measured observations
hidden.states = estimate.hidden.states(model.hmm, data.newly.measured)
Data (training/testing)
To be able to run learning methods for Naive Bayes classifier, we need longer data set
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
artificial.hypnogram = rep(c(5,4,1,2,3,4,5), times = c(40,150,200,300,50,90,30))
data.tagged.by.expert = data.frame(
hidden.state = states[artificial.hypnogram],
sensor1 = log(artificial.hypnogram) + runif(n = length(artificial.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*artificial.hypnogram + sample(c(-8:8), size = length(artificial.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(artificial.hypnogram), replace = T)
)
hidden.hypnogram = rep(c(5,4,1,2,4,5), times = c(10,10,15,10,10,3))
data.newly.measured = data.frame(
sensor1 = log(hidden.hypnogram) + runif(n = length(hidden.hypnogram), min = 0.2, max = 0.5),
sensor2 = 10*hidden.hypnogram + sample(c(-8:8), size = length(hidden.hypnogram), replace = T),
sensor3 = sample(1:100, size = length(hidden.hypnogram), replace = T)
)
Solution
In the solution, we used Viterbi algorithm - combined with Naive Bayes classifier.
At each clock time t, a Hidden Markov Model consist of
an unobserved state (denoted as hidden.state in this case) taking a finite number of states
states = c("NonREM1", "NonREM2", "NonREM3", "REM", "Wake")
a set of observed variables (sensor1, sensor2, sensor3 in this case)
Transition matrix
A new state is entered based upon a transition probability distribution
(transition matrix). This can be easily computed from data.tagged.by.expert e.g. using
library(markovchain)
emit_p <- markovchainFit(data.tagged.by.expert$hidden.state)$estimate
Emission matrix
After each transition is made, an observation (sensor_i) is produced according to a conditional probability distribution (emission matrix) which depends on the current state H of hidden.state only. We will replace emmision matrices by Naive Bayes classifier.
library(caret)
library(klaR)
library(e1071)
model = train(hidden.state ~ .,
data = data.tagged.by.expert,
method = 'nb',
trControl=trainControl(method='cv',number=10)
)
Viterbi algorithm
To solve the problem, we use Viterbi algorithm with the initial probability of 1 for "Wake" state and 0 otherwise. (We expect the patient to be awake in the beginning of the experiment)
# we expect the patient to be awake in the beginning
start_p = c(NonREM1 = 0,NonREM2 = 0,NonREM3 = 0, REM = 0, Wake = 1)
# Naive Bayes model
model_nb = model$finalModel
# the observations
observations = data.newly.measured
nObs <- nrow(observations) # number of observations
nStates <- length(states) # number of states
# T1, T2 initialization
T1 <- matrix(0, nrow = nStates, ncol = nObs) #define two 2-dimensional tables
row.names(T1) <- states
T2 <- T1
Byj <- predict(model_nb, newdata = observations[1,])$posterior
# init first column of T1
for(s in states)
T1[s,1] = start_p[s] * Byj[1,s]
# fill T1 and T2 tables
for(j in 2:nObs) {
Byj <- predict(model_nb, newdata = observations[j,])$posterior
for(s in states) {
res <- (T1[,j-1] * emit_p[,s]) * Byj[1,s]
T2[s,j] <- states[which.max(res)]
T1[s,j] <- max(res)
}
}
# backtract best path
result <- rep("", times = nObs)
result[nObs] <- names(which.max(T1[,nObs]))
for (j in nObs:2) {
result[j-1] <- T2[result[j], j]
}
# show the result
result
# show the original artificial data
states[hidden.hypnogram]
References
To read more about the problem, see Vomlel Jiří, Kratochvíl Václav : Dynamic Bayesian Networks for the Classification of Sleep Stages , Proceedings of the 11th Workshop on Uncertainty Processing (WUPES’18), p. 205-215 , Eds: Kratochvíl Václav, Vejnarová Jiřina, Workshop on Uncertainty Processing (WUPES’18), (Třeboň, CZ, 2018/06/06) [2018] Download

Predicting with bsts model and updated olddata

I've built a bsts model using 2 years of weekly historical data. I'm able to predict using the model with the existing training data. In order to mimic the process that would occur with the model in production, I've created an xts object that moves the 2 years of data forward by one week. When I try to predict using this dataset (populating the olddata parameter in predict.bsts), I receive the following error:
Error in terms.default(object) : no terms component nor attribute
I realize I'm probably doing something dumb here, but haven't been able to find any examples of usage of values of olddata when predicting. Appreciate any help you can provide.
Thanks
dat = xts(fcastdat$SumScan_units,order.by = fcastdat$enddate)
traindat = window(dat, start=as.Date("2015-01-03"), end=as.Date("2016-12-26"))
ss = AddLocalLevel(list(), traindat)
ss = AddSeasonal(ss, traindat, nseasons = 52)
holidays = c("EasterSunday", "USMothersDay", "IndependenceDay", "MemorialDay", "LaborDay", "Thanksgiving", "Christmas")
ss = AddNamedHolidays(ss, named.holidays = holidays, traindat)
model_loclev_seas_hol = bsts(traindat, state.specification = ss, niter = 500, ping=50, seed=1289)
burn = SuggestBurn(0.1, model_loclev_seas_hol)
pred_len = 5
pred = predict.bsts(model_ll_seas_hol, horizon = pred_len, burn = burn, quantiles = c(.025, .975))
begdt = index(traindat[1]) + 7
enddt = index(traindat[length(traindat)]) + 7
predseries = window(dat, start=as.Date(begdt), end=as.Date(enddt))
pred2 = predict.bsts(model_ll_seas_hol, horizon=pred_len, burn=burn, olddata = predseries,quantiles = c(.025, .975))

Random Forest using R

i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
code:
library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)
setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]
#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names
target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)
tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
Thnks in advance.
It is hard to say without knowing the data. Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors. If you really do have more than 53 levels, you will have to convert them to binary variables. Example:
mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1

R: Holt-Winters with daily data (forecast package)

In the following example, I am trying to use Holt-Winters smoothing on daily data, but I run into a couple of issues:
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# convert to a zoo object
zooData = as.zoo(mData[, 2, drop = FALSE],
order.by = as.Date(mData[, 1, drop = FALSE], format = '%Y-%m-%d'),
frequency = 7)
# attempt Holt-Winters smoothing
hw(x = zooData, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
# no missing values in the data
sum(is.na(zooData))
This leads to the following error:
Error in ets(x, "AAA", alpha = alpha, beta = beta, gamma = gamma,
damped = damped, : You've got to be joking. I need more data! In
addition: Warning message: In ets(x, "AAA", alpha = alpha, beta =
beta, gamma = gamma, damped = damped, : Missing values encountered.
Using longest contiguous portion of time series
Emphasis mine.
Couple of questions:
1. Where are the missing values coming from?
2. I am assuming that the "need more data" arises from attempting to estimate 365 seasonal parameters?
Update 1:
Based on Gabor's suggestion, I have recreated a fractional index for the data where whole numbers are weeks.
I have a couple of questions.
1. Is this is an appropriate way of handling daily data when the periodicity is assumed to be weekly?
2. Is there is a more elegant way of handling the dates when working with daily data?
library(zoo)
library(forecast)
# generate some dummy daily data
mData = cbind(seq.Date(from = as.Date('2011-12-01'),
to = as.Date('2013-11-30'), by = 'day'), rnorm(731))
# conver to a zoo object with weekly frequency
zooDataWeekly = as.zoo(mData[, 2, drop = FALSE],
order.by = seq(from = 0, by = 1/7, length.out = 731))
# attempt Holt-Winters smoothing
hwData = hw(x = zooDataWeekly, h = 10, seasonal = 'additive', damped = FALSE,
initial = 'optimal', exponential = FALSE, fan = FALSE)
plot(zooDataWeekly, col = 'red')
lines(fitted(hwData))
hw requires a ts object not a zoo object. Use
zooDataWeekly <- ts(mData[,2], frequency=7)
Unless there is a good reason for specifying the model exactly, it is usually better to let R select the best model for you:
fit <- ets(zooDataWeekly)
fc <- forecast(fit)
plot(fc)

Resources