Create ROC curve manually from data frame - r

I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time column indidates the time needed per case
truth column indicates my current decision I want to compare against
predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))

You can use ROCR too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
auc#y.values[[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!

So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

Related

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Is there a way to simulate time series data with a specific rolling mean and autocorrelation in R?

I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.

value at risk estimation using fGarch package in R

I am trying to make a similar analysis to McNeil & Frey in their paper 'Estimation of tail-related risk measures for heteroscedastic financial time series: an extreme value approach' but I am stuck with a problem when implementing the models.
The approach is to fit a AR(1)-GARCH(1,1) model in order to estimate the the one-day ahead forecast of the VaR using a window of 1000 observations.
I have simulated data that should work fine with my model, and I assume that if I would be doing this correct, the observed coverage rate should be close to the theoretical one. However it is always below the theoretical coverage rate, and I don´t know why.
I beleive that this is how the calculation of the estimated VaR is done
VaR_hat = mu_hat + sigma_hat * qnorm(alpha)
, but I might be wrong. I have tried to find related questions here at stack but I have not found any.
How I approach this can be summarized in three steps.
Simulate 2000 AR(1)-GARCH(1,1) observations and fit a corresponding model and extract the one day prediction of the conditional mean and standard deviation using a window of 1000 observations.(Thereby making 1000 predictions)
Use the predicted values and the normal quantile to calculate the VaR for the wanted confidence level.
Check if the coverage rate is close to the theoretical one.
If someone could help me I would be extremely thankful, and if I'm unclear in my formalation please just tell me and I'll try to come up with a better explanation to the problem.
The code I'm using is attached below.
Thank you in advance
library(fGarch)
nObs <- 2000 # Number of observations.
quantileLevel <- 0.95 # Since we expect 5% exceedances.
from <- seq(1,1000) # Lower index vector for observations in model.
to <- seq(1001,2000) # Upper index vector for observations in model.
VaR_vec <- rep(0,(nObs-1000)) # Empty vector for storage of 1000 VaR estimates.
# Specs for simulated data (including AR(1) component and all components for GARC(1,1)).
spec = garchSpec(model = list(omega = 1e-6, alpha = 0.08, beta = 0.91, ar = 0.10),
cond.dist = 'norm')
# Simulate 1000 data points.
data_sim <- c(garchSim(spec, n = nObs, n.start = 1000))
for (i in 1:1000){
# The rolling window of 1000 observations.
data_insert <- data_sim[from[i]:to[i]]
# Fitting an AR(1)-GARCH(1,1) model with normal cond.dist.
fitted_model <- garchFit(~ arma(1,0) + garch(1,1), data_insert,
trace = FALSE,
cond.dist = "norm")
# One day ahead forecast of conditional mean and standard deviation.
predict(fitted_model, n.ahead = 1)
prediction_model <- predict(fitted_model, n.ahead = 1)
mu_pred <- prediction_model$meanForecast
sigma_pred <- prediction_model$standardDeviation
# Calculate VaR forecast
VaR_vec[i] <- mu_pred + sigma_pred*qnorm(quantileLevel)
if (length(to)-i != 0){
print(c('Countdown, just',(length(to) - i),'iterations left'))
} else {
print(c('Done!'))
}
}
# Exctract only the estiamtes ralated to the forecasts.
compare_data_sim <- data_sim[1001:length(data_sim)]
hit <- rep(0,length(VaR_vec))
# Count the amount of exceedances.
for (i in 1:length(VaR_vec)){
hit[i] <- sum(VaR_vec[i] <= compare_data_sim[i])
}
plot(data_sim[1001:2000], type = 'l',
ylab = 'Simulated data', main = 'Illustration of one day ahead prediction of 95%-VaR')
lines(VaR_vec, col = 'red')
cover_prop <- sum(hit)/length(hit)
print(sprintf("Diff theoretical level and VaR coverage = %f", (1-quantileLevel) - cover_prop))

Monte Carlo simulation in R

I am trying to simulate data (Y) from an AR(1) model with rho=0.7. Then I will use this data to run a regression of Y on an intercept ( by so doing the parameter estimate becomes the mean of Y), then test the null hypothesis of the coefficient being less than or equal to zero ( alternative is greater than 0) using robust standard errors.
I want to run a Monte Carlo simulation of this hypothesis using 2000 replications for different lag values. the purpose is to show the finite sample performance of the Newey West estimator as the lag changes. so this is how I began
A<-array(0, dim=c(2000,1))
for(i in 1:2000){
y_new<-arima.sim(model=list(ar=0.7), n=50, mean=0,sd=1)
reg<-lm(y_new~1)
ad<-coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag=1, prewhite=FALSE))
A[i]<-ad[,3]
}
My question: is the code above the right way of doing this kind of simulation? And if it is, how can I get a code to repeat this process for different lag values in the HAC test. I want to run the test each time increasing the lag by 1, thus I will be doing this 50 times for lags 1,2,3,4......,50, each time storing the 2000 simulated test statistics in a vector with different names. calculate rejection probabilities for the test statistic (sig. level =0,05, using the critical value of 1.645) for each case and plot them(rejection probabilities) against the various lag values.
Please help
Because you didn't mention the possible purpose of the simulation, it is hard to tell whether it is the right way.
You save a lot of time by computing 50 test statistics for each simulated sample, instead of repeating the simulation 2000 times for each lag (that is, the number of simulation is 2000*50).
Much better format of doing simulation is
library(AER)
library(dplyr)
lags <- 1:50
nreps <- 2000
sim <- function (){
ynew <- arima.sim(model = list(ar=0.7), n=50, mean=0, sd=1)
reg <- lm(ynew ~ 1 )
s <- rep(NA, 50)
for(i in lags){
ad <- coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag = i, prewhite=FALSE))
s[i] <- ad[ ,4]
}
s
}
Following code stores simulation results in a data.frame
result <- lapply(1:nreps, function(i)data.frame(simulation = i, lag = lags, pvalues = sim())) %>%
rbind_all
From your vague description, I extrapolate what you want looks something like
library(ggplot2)
result %>%
group_by(lag) %>%
summarize(rejectfreq = mean(pvalues > 0.05)) %>%
ggplot(., aes(lag, rejectfreq)) + geom_line()+
coord_cartesian(ylim = c(0,1)) +
scale_y_continuous(breaks=seq(0, 1, by=0.1))
Although the figure was created using only 100 simulations, it is evident that the choice of the lags in Newey-West wouldn't matter much when the disturbance terms are i.i.d.

Are ROC curve calculations something that can be split up for parallel processing?

In R (or any other platform / language) is the calculation of a ROC curve something that can be split up and performed in parallel. I'm doing one w/ 150k lines and it takes about 5-7 minutes for each calculation using the pROC package. Any other suggestions for quicker AUC or ROC calculations would be appreciated. Thanks.
The calculation of an ROC curve should be quite fast since it really just sorting results and calculating a cumulative sum of proportions, but my guess is that you are doing something more complex (or you are doing it in a very inefficient manner). This illustrates construction of an ROC curve for 15000 points ... almost instantanrous ( and doing it with 150K did slow it down a bit, but still under 2 seconds):
testres <- data.frame(res=rnorm(15000), cat=rbinom(15000,1, .2))
require(ROCR)
pred <- prediction(testres$res, testres$cat)
perf <- performance(pred,"tpr","fpr")
plot(perf)
screechOwl. I am well aware that this post is from 6 years ago. I came across this when I was searching for more efficient method to compute AUC.
I have recently written a function that drastically reduces the time required to get the value of AUC, it also works with factors.
AUROC <- function(Target, Prediction){
if(is.numeric(Prediction) == T){
Order <- order(Prediction)
Order <- Order[!Order %in% which(is.na(Prediction))]
Prediction <- Prediction[Order]
Target <- Target[Order]
FP <- cumsum(!Target)/sum(!Target)
TP <- cumsum(Target)/sum(Target)
Repeated <- c(Prediction[-1] == Prediction[-length(Prediction)], FALSE)
FP <- c(0, FP[!Repeated], 1)
TP <- c(0, TP[!Repeated], 1)
} else if(is.factor(Prediction) == T | is.character(Prediction) == T){
DT1 <- data.table(Y = Target, X = Prediction)
DT1 <- DT1[, list(Event = sum(Y),
`Non Event` = sum(!Y)),
by = X]
DT1[, Probability := Event / (Event + `Non Event`)]
DT1 <- DT1[order(Probability), ]
FP <- c(0, cumsum(DT1[, `Non Event`])/ sum(DT1[, `Non Event`]), 1)
TP <- c(0, cumsum(DT1[, Event])/ sum(DT1[, Event]), 1)
}
N <- length(FP)
TP_avg <- (TP[-1] + TP[-N])/ 2
FP_dif <- FP[-1] - FP[-N]
Area <- sum(TP_avg * FP_dif)
Results <- 0.5 + abs(Area - 0.5)
return(Results)
}
However, I would say this is mostly the contribution of John Mount.
(http://www.win-vector.com/blog/2016/10/on-calculating-auc/)
The part that I have added is essentially the ability to deal with factors. Note that with some modifications in the last few lines, you can get this function to plot ROC curve. My findings show that this outperforms the method using prediction(), performance() and plot(), up until a few million rows or so, probably due to higher overheads. Beyond that point, it is better to stick with the method suggested by "42-".
Hope this helps.

Resources