How can I extract confidence intervals from ezBoot (ez package)? - r

I am using the ezBoot function from the ez package. I would like to extract the confidence intervals that are plotted with the ezPlot2 function of the same package.
An example can be found in the ezBoot function:
#Read in the ANT data (see ?ANT).
data(ANT)
head(ANT)
ezPrecis(ANT)
#Run ezBoot on the accurate RT data
rt = ezBoot(
data = ANT
, dv = rt
, wid = subnum
, within = .(cue,flank)
, between = group
, iterations = 1e1 #1e3 or higher is best for publication
)
#plot the full design
p = ezPlot2(
preds = rt
, x = flank
, split = cue
, col = group
)
print(p)
How do I extract the confidence intervals?

Nevermind, I didn't read the complete set of arguments for the [ezPlot2 function] (http://www.inside-r.org/packages/cran/ez/docs/ezPlot2). To be fair the initial description says that the functions is for displaying, however if the parameter do_plot is set to true then it will return the point predictions (I am guessing it is the averages) and confidence intervals:
do_plot: Logical. If TRUE, no plot will be produced but instead a data frame
containing point predictions and confidence limits will be returned.

Related

Is there a way to simulate time series data with a specific rolling mean and autocorrelation in R?

I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.

value at risk estimation using fGarch package in R

I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I donĀ“t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))

How to plot per tree ROC curves from randomForest in R?

I know that randomForest is supposed to be a black box, and that most people are interested in the ROC curve of the classifier as a whole, but I'm working on a problem in which I need to inspect individual trees of RF. I'm not very experienced with R so what's an easy way to plot ROC curves for the individual trees generated by RF?
I don't think you can generate a ROC curve from a single tree from a random forest generated by the randomForest package. You can access the output of each tree from a prediction, for example over the training set.
# caret for an example data set
library(caret)
library(randomForest)
data(GermanCredit)
# use only 50 rows for demonstration
nrows = 50
# extract the first 9 columns and 50 rows as training data (column 10 is "Class", the target)
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# build the model
rf_model = randomForest(x = x, y = y, ntree = 11)
# Compute the prediction over the training data. Note predict.all = TRUE
rf_pred = predict(rf_model, newdata = x, predict.all = TRUE, type = "prob")
You can access the predictions of each tree with
rf_pred$individual
However, the prediction of a single tree is only the most likely label. For a ROC curve you need class probabilities, so that changing the decision threshold changes the predicted class to vary true and false positive rates.
As far as I can tell, at least in package randomForest there is no way to make the leaves output probabilities instead of labels. If you inspect a tree with getTree(), you will see that the prediction is binary; use getTree(rf_model, k = 1, labelVar = TRUE) and you'll see the labels in plain text.
What you can do, though, is to retrieve individual predictions via predict.all = TRUE and then manually compute class labels on subsets of the whole forest. This you can then input into a function to compute ROC curves like those from the ROCR package.
Edit: Ok, from the link you provided in your comment I got the idea how a ROC curve can be obtained. First, we need to extract one particular tree and then input each data point into the tree, in order to count the occurances of the success class at each node as well as total data points in each node. The ratio gives the node probability for success class. Next, we do something similar, i.e. input each data point into the tree, but now record the probability. This way we can compare the class probs with the true label.
Here is the code:
# libraries we need
library(randomForest)
library(ROCR)
# Set fixed seed for reproducibility
set.seed(54321)
# Define function to read out output node of a tree for a given data point
travelTree = function(tree, data_row) {
node = 1
while (tree[node, "status"] != -1) {
split_value = data_row[, tree[node, "split var"]]
if (tree[node, "split point"] > split_value ) {
node = tree[node, "right daughter"]
} else {
node = tree[node, "left daughter"]
}
}
return(node)
}
# define number of data rows
nrows = 100
ntree = 11
# load example data
data(GermanCredit)
# Easier access of variables
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# Build RF model
rf_model = randomForest(x = x, y = y, ntree = ntree, nodesize = 10)
# Extract single tree and add variables we need to compute class probs
single_tree = getTree(rf_model, k = 2, labelVar = TRUE)
single_tree$"split var" = as.character(single_tree$"split var")
single_tree$sum_good = 0
single_tree$sum = 0
single_tree$pred_prob = 0
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree$sum_good[out_node] = single_tree$sum_good[out_node] + (y[zeile] == "Good")
single_tree$sum[out_node] = single_tree$sum[out_node] + 1
}
# Compute class probabilities from count of "Good" data points in each node.
# Make sure we do not divide by zero
idcs = single_tree$sum != 0
single_tree$pred_prob[idcs] = single_tree$sum_good[idcs] / single_tree$sum[idcs]
# Compute prediction by inserting again data set into tree, but read out
# previously computed probs
single_tree_pred = rep(0, nrow(x))
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree_pred[zeile] = single_tree$pred_prob[out_node]
}
# Et voila: The ROC curve for single tree!
plot(performance(prediction(single_tree_pred, y), "tpr", "fpr"))

Obtaining confidence interval for npreg as values, not as plot

I am using the well known "np" package of Hayfield & Racine for non-parametric regressions. It allows plotting confidence bands for the estimated coefficient based on bootstrap procedures. See the code below for an example.
Question: I am wondering how to obtain these confidence intervalls in numerical form? One, but not the only reason for this question is that I really don't like the presentation of the ci's. More generally speaking, I would like to use and further process the confidence band within my analysis.
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
plot(mynp, plot.errors.method="bootstrap")`
when executing plot, the function is calling to the plot method of np package which is the function npplot
npplot exepts an argument plot.behavior which equals to plot by default which plots the results and returns NULL. you should set plot.behavior = "plot-data", and the function will plot and return the data of the object.
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Than the values in the line can be accesed through dat$r1$mean and the values to be added to the mean to get the upper and lower ci accesed through dat$r1$merr.
notice that not all value are plotted. only half of them (every other value and than the last).
read the 'help' on npplot for more options.
further is an example of the use of the code and the results:
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Then recreating the results:
z <- unlist(dat$r1$eval,use.names = F)
CI.up = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,2])
CI.dn = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,1])
plot(dat$r1$mean~z, cex=1.5,xaxt='n', ylim=c(1.0,3.5),xlab='',ylab='lalala!', main='blahblahblah',col='blue',pch=16)
arrows(z,CI.dn,z,CI.up,code=3,length=0.2,angle=90,col='red')
we will get:
As you can see, theresults are the same (only I have calculated the intervals for each point and not only for half of them).
note the plot.errors.type attribute for npplot which gets "standard" and "quantiles" and is "standard" at default. When you specify "standard" dat$r1$merr will keep the standard errors and the plot will include mean+std err as intervals. Alternatively the plot will include the quantiles as the intervals and the quantiles will be saved at dat$r1$merr. which quntiles to use are specified by plot.errors.quantiles quantiles and it's only relevant if plot.errors.type = "quantiles"

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Resources