I have a set of estimated parameters for an Ex-gaussian curve (i.e. mu, sigma, tau).
Currently I'm creating a visualization of that distribution by simulating data based on those parameters and plotting them in ggplot.
I would rather create a visualization that is effectively a smooth fitted ex-gaussian curve - i.e. an estimated curve for data that presents with the parameters I've estimated. The goal is to not have curves with the same parameters appear differently.
Here is the current simulation approach I'm utilizing:
library(retimes)
library(ggplot2)
g <- rexgauss(1000,mu=1,sigma = 1,tau =1)
g <- as.data.frame(g); colnames(g) <- "obs"
ggplot(g) + geom_density(aes(x = obs), size=1, alpha=.4)
You can use stat_function from ggplot2. It takes a function in fun, and parameters to pass to that function in args. It works well for situations like this where you want to compare a simulation to a calculated distribution, because the x values you supply to aes will be the ones automatically used in showing the function, without you having to do any work to match them up or calculate the range of x values in your simulation.
Here's an example with retimes::rexgauss. I also simplified your data frame creation, and put the parameters in a vector so you can use them in both the simulation and the calculated function.
My laptop is too slow to do all 1000 observations, so yours is probably smoother and closer to the calculated distribution than mine.
library(ggplot2)
exgauss_params <- c(mu = 1, sigma = 1, tau = 1)
exgauss_sim <- data.frame(obs = retimes::rexgauss(n = 100, exgauss_params))
ggplot(exgauss_sim, aes(x = obs)) +
geom_density(aes(color = "simulated")) +
stat_function(aes(color = "calculated"),
fun = retimes::dexgauss, args = exgauss_params)
Created on 2018-05-18 by the reprex package (v0.2.0).
Related
Is it possible to filter a trend like this:
set.seed(1)
n=1000
mu = c(rep(1,100),rep(3,100),rep(5,100),rep(2,100),rep(1,600))
y = mu + rnorm(n)
and then obtain a numerical vector that defines the new trend?
It would be optimal if you could also use different threshold values
It sounds like you are looking for a smoothing function. There are many ways to achieve this: for example rolling average, loess, generalized additive models. If you want the trend to be composed of straight line sections, as in your example, you could try a regression with b-splines and degree 1.
This little function would perform such a task:
library(splines)
smoother <- function(x, n = floor(length(x) / 10), deg = 1) {
predict(lm(y ~ bs(seq_along(y), knots = seq(1, length(y), n), degree = deg)))
}
The x argument is the data from which you are trying to find the trend, and n is the number of measurements between knots (that is, the points where the gradient of the line can change). deg is the degree of the polynomial used (1 for straight line segments, and higher numbers for smoother polynomial fits).
Trying this on your example, we would get something like this:
plot(y, type = 'l')
trend <- smoother(y, 50)
lines(trend, col = 'red')
Or if you wanted a less jagged line:
plot(y, type = 'l')
lines(smoother(y, 75, 4), col = 'red')
The following code was made up to replicate my problem with a bigger, more complex data set.
library(marginaleffects)
library(truncnorm)
yield_kgha<-rtruncnorm(n=100, mean=2000, sd=150)
n_kgha<-rtruncnorm(n=100, a=40, b=298, mean=150, sd=40)
i<-lm(yield_kgha~n_kgha+I(n_kgha^2))
summary(i)
I have used the predictions function from the marginaleffects package to see the predicted yields(yield_kgha) for a range of nitrogen rates (n_kgha) from my regression model (i). My original data only has n_kgha rates ranging from approximately 40-250, so the below code allows me to see predicted yields at n_kgha rates not actually in my data set.
p <- predictions(
i,
newdata = datagrid(model = i, n_kgha = seq(0, 300, by = 25), grid_type = "counterfactual"))
summary(p, by = "n_kgha")
I would like to plot the response of yield conditioned on n_kgha ranging from 0-300, which includes n_kgha values not in my original data set. I have tried to do this using the plot_cap function.
plot_cap(i, condition = "n_kgha")
However, since my original data only has n_kgha rates ranging from 40-250 I am not getting my desired result of seeing the response curve over the n_kgha (0:300) range. When I plot using the plot_cap function I get the following response curve with n_kgha ranging from 40-250 (the max and min of the original data set).
Is there a way to run the plot_cap function based on the counterfactual range of n_kgha as used in the predictions function? Or should I use another method to plot the predicted values based on counterfactual values?
The plot_cap() function only plots over the observed range of values. However, since predictions() returns a “tidy” data frame, it is trivial to use the the output of this function with ggplot2 to get the plot you need.
Note that we do not need to specify grid.type="counterfactual". This is option will do something very weird and specific: duplicate the whole dataset many times for each value of the user-supplied values. It is only useful in very specific corner-cases, and not when you just want to make predictions over unobserved values of the predictors. See the documentation with ?datagrid.
Here’s a simple example of predictions() with ggplot2 to achieve what you want:
library(marginaleffects)
library(truncnorm)
library(ggplot2)
yield_kgha <- rtruncnorm(n = 100, mean = 2000, sd = 150)
n_kgha <- rtruncnorm(n = 100, a = 40, b = 298, mean = 150, sd = 40)
i <- lm(yield_kgha ~ n_kgha + I(n_kgha^2))
p <- predictions(i, newdata = datagrid(n_kgha = seq(0, 300, 10)))
ggplot(p, aes(n_kgha, predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_line() +
theme_minimal()
I am attempting to simulate data that approximates rate data - that is: count data that generally fit a negative binomial distribution but also an offset term for survey effort.
I think I can simulate counts well using the negative-binomial function (rnbinom()), but then don't have a way to account for the offset term - which will be random with each survey. Put another way:
What is the best way to simulate non-integer rate data?
Is there a realistic way to simulate the offset term values?
Do I need to use a different distribution beyond negative-binomial to generate a realistic range of non-integer values?
Background: Our surveys measure counts of individual per unit of survey effort (time), and the resulting rate is a positive non-integer ( >= 0). The survey count data seems well modeled with a negative-binomial distribution, and in a GLM framework, I would account for effort using survey time as an offset term. In the simulated data below, I generate a negative-binomial distribution to represent data within my actual survey. The offset term is simulated as a random uniform variable between 2-10 (the range of search times in minutes). Rate is then calculated as counts/time.
I plot histograms of both counts and rate to help demonstrate that rates here take many fractional values between integers. Because survey counts are often correlated with survey effort, it is critical that I ultimately use the rate data for analysis (i.e. figure 'B' below).
library(tidyverse)
theme_set(theme_classic())
d = data.frame(counts = rnbinom(n = 500,mu = 5,size = 1), ## dispersion parameter 'theta' set to 1
time = runif(500,2,10)) %>%
mutate(rate = counts/time)
## Count data histogram
ggplot(d, aes(x = counts))+
geom_histogram(fill = 'peachpuff',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('A: Histogram of Counts')
## Rate data histogram
ggplot(d, aes(x = rate))+
geom_histogram(binwidth = .1, fill = 'dodgerblue1',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
scale_y_continuous(expand = c(0,0))+
ylab('frequency')+
ggtitle('B: Histogram of Rate')
Below I can readily simulate counts from my original survey data, but don't know how to properly simulate back data as a rate. For example, if I fit an intercept-only nbinom GLM , I can use the coefficient to simulate new negative-binomial distributions of counts that are very similar to the original data (i.e. uses a similar value for 'mu')
[I realize this seems circular in this example, but this is my approach with real data. First describe the mean value and dispersion 'theta' with a GLM, then simulate back datasets that mimic my original dataset]
I use this approach below both to generate back count data, but also by fitting a model with the offset term in order to simulate back a distribution that has the mean 'rate' from figure 'B'.
### Simulate back count data from the original survey data:
##describe mean value 'mu' by finding intercept
## 'theta' could also be calculated
m1 = MASS::glm.nb(counts ~ 1, data = d)
# summary(m1)
# mean(d$counts)
# exp(m1$coefficients[1])
## simulated negative-binomial distribution using calculated 'mu'
d.sim = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m1$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## Plot and compare with plot 'A' above
ggplot(d.sim, aes(x = new.counts))+
geom_histogram(fill = 'peachpuff3',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('C: Simulated Counts')
###########################################
###########################################
### Simulate back 'rate' data by including an offset term for effort in the GLM model
## the exponentiated coefficient should equal the mean of the raw rate data
m2 = MASS::glm.nb(counts ~ 1 + offset(log(time)), data = d)
# summary(m2)
# mean(d$rate)
# exp(m2$coefficients[1])
d.sim.2 = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m2$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## compare these simulated 'rate' data with the non-integer 'true rate' data in figure D
ggplot(d.sim.2, aes(x = new.counts))+
geom_histogram(binwidth = .1, fill = 'dodgerblue3',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('D: Simulated Rate')
So it is at this point that I've generated figure 'C' as a simulated dataset representing counts that I have observed in real life, which closely matches the original data in figure 'A'. The 'rate' data in figure 'D' is (necessarily) all integer values drawn from rnbinom(), and while the mean of figure 'D' is approximate to the mean of figure 'B', my sense is that these two distributions are not really equivalent.
So my questions again:
Is there a way that I could instead simulate data to match figure 'B' (non-integer rate data)?
Do you think that data in figure 'D' will work as an approximate to 'B' since the mean values (and dispersion) are similar?
For additional context, I'll be using the simulated datasets (many of them) to run other Monte-Carlo type simulation analysis (e.g. power analysis). I'm worried that if I use data generated in Figure 'D', it won't really represent what my actual survey data will be (figure 'B').
The way you generate your sample data (in place of your empirical data), does not align with the data generating process you describe. The count data from rnbinom(n = 500, mu = 5, size = 1) does not depend on the time. mu needs to be a function of the time variable, or else the counts are independent of time.
Also, setting size = 1 means there is no overdispersion (nor underdispersion), thus it should rather be called a Poisson distribution, which is a special case of the negative binomial distribution. But given your description of the DGP it sounds like there would be overdispersion in the empirical data.
To answer your first question, you see a code example below. Regarding your second question, no I don't think that would be a good idea.
library(tidyverse)
library(rstanarm)
options(mc.cores = parallel::detectCores())
n <- 1000
empirical <-
tibble(
time = runif(n, 2, 10),
count = rnbinom(n = n, mu = time, size = 1) # Generate count data that actually depends on time
) |>
mutate(rate = count/time)
m_stan <- stan_glm.nb(count ~ time, data = empirical)
simulated <-
tibble(
time = runif(n, 2,10),
) %>%
mutate(
count = posterior_predict(m_stan, ., draws = 1) |>
as.vector(),
rate = count/time
)
d <- lst(simulated, empirical) |>
bind_rows(.id = "data")
d |>
select(data, count, rate) |>
pivot_longer(c(count, rate)) |>
ggplot() +
geom_histogram(aes(value), binwidth = .2) +
facet_grid(data ~ name, scales = "free")
Created on 2022-02-03 by the reprex package (v2.0.1)
The response variable for my dataset is comprised of observations Y[1], Y[2], ...., Y[49]. I came up with a Bayesian Hierarchical Model to make Bayesian predictions for Y[50]. I also have MCMC samples for Y[1],...,Y[49], which I can use to assess the overall fit of my Bayesian model by comparing them with the actual values of Y[1], Y[2], ...., Y[49].
Is there any way that I can draw the caterpillar plots of my Bayesian Predictions from the MCMC object of the Hierarchical Model along with the points that stands for actual observed Y's from my original dataset on R?
Thank you,
First you need to extract your confidence intervals for each $Y_i$ . (usually this is done with quantile function if you're not using a standard S3 object).
Then you create the following df:
df <- data_frame(
obs = seq(from = 1,
to = 49,
by = 1),
lower = q1,
upper = q2,
estimate = estimate,
actual = actual)
Then you go:
df %>% ggplot(aes(x = obs)) +
geom_line(aes(y = actual)) +
geom_pointrange(aes(ymin = lower, ymax = upper, y = estimate)) +
coord_flip()
If you're doing hierarchical models I really recommend using rstanarm package which is compatible with the tidybayes library (which produces automatic caterpillar plots).
My apologies if I'm missing something obvious. I've been thoroughly enjoying working with h2o in the last few days using R interface. I would like to evaluate my model, say a random forest, by plotting an ROC. The documentation seems to suggest that there is a straightforward way to do that:
Interpreting a DRF Model
By default, the following output displays:
Model parameters (hidden)
A graph of the scoring history (number of trees vs. training MSE)
A graph of the ROC curve (TPR vs. FPR)
A graph of the variable importances
...
I've also seen that in python you can apply roc function here. But I can't seem to be able to find the way to do the same in R interface. Currently I'm extracting predictions from the model using h2o.cross_validation_holdout_predictions and then use pROC package from R to plot the ROC. But I would like to be able to do it directly from the H2O model object, or, perhaps, a H2OModelMetrics object.
Many thanks!
A naive solution is to use plot() generic function to plot a H2OMetrics object:
logit_fit <- h2o.glm(colnames(training)[-1],'y',training_frame =
training.hex,validation_frame=validation.hex,family = 'binomial')
plot(h2o.performance(logit_fit),valid=T),type='roc')
This will give us a plot:
But it is hard to customize, especially to change the line type, since the type parameter is already taken as 'roc'. Also I have not found a way to plot multiple models' ROC curves together on one plot. I have come up with a method to extract true positive rate and false positive rate from the H2OMetrics object and use ggplot2 to plot the ROC curves on one plot by myself. Here is the example code(uses a lot of tidyverse syntax):
# for example I have 4 H2OModels
list(logit_fit,dt_fit,rf_fit,xgb_fit) %>%
# map a function to each element in the list
map(function(x) x %>% h2o.performance(valid=T) %>%
# from all these 'paths' in the object
.#metrics %>% .$thresholds_and_metric_scores %>%
# extracting true positive rate and false positive rate
.[c('tpr','fpr')] %>%
# add (0,0) and (1,1) for the start and end point of ROC curve
add_row(tpr=0,fpr=0,.before=T) %>%
add_row(tpr=0,fpr=0,.before=F)) %>%
# add a column of model name for future grouping in ggplot2
map2(c('Logistic Regression','Decision Tree','Random Forest','Gradient Boosting'),
function(x,y) x %>% add_column(model=y)) %>%
# reduce four data.frame to one
reduce(rbind) %>%
# plot fpr and tpr, map model to color as grouping
ggplot(aes(fpr,tpr,col=model))+
geom_line()+
geom_segment(aes(x=0,y=0,xend = 1, yend = 1),linetype = 2,col='grey')+
xlab('False Positive Rate')+
ylab('True Positive Rate')+
ggtitle('ROC Curve for Four Models')
Then the ROC curve is:
you can get the roc curve by passing the model performance metrics to H2O's plot function.
shortened code snippet which assumes you created a model, call it glm, and split your dataset into train and validation sets:
perf <- h2o.performance(glm, newdata = validation)
h2o.plot(perf)
full code snippet below:
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
glm = h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.hex, family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE)
perf <- h2o.performance(glm, newdata = prostate.hex)
h2o.plot(perf)
and this will produce the following:
There is not currently a function in H2O R or Python client to plot the ROC curve directly. The roc method in Python returns the data neccessary to plot the ROC curve, but does not plot the curve itself. ROC curve plotting directly from R and Python seems like a useful thing to add, so I've created a JIRA ticket for it here: https://0xdata.atlassian.net/browse/PUBDEV-4449
The reference to the ROC curve in the docs refers to the H2O Flow GUI, which will automatically plot a ROC curve for any binary classification model in your H2O cluster. All the other items in that list are in fact available directly in R and Python, however.
If you train a model in R, you can visit the Flow interface (e.g. localhost:54321) and click on a binomial model to see it's ROC curves (training, validation and cross-validated versions). It will look like this:
Building off #Lauren's example, after you run model.performance you can extract all necessary information for ggplot from perf#metrics$thresholds_and_metric_scores. This code produces the ROC curve, but you can also add precision, recall to the selected variables for plotting the PR curve.
Here is some example code using the same model as above.
library(h2o)
library(dplyr)
library(ggplot2)
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath <- system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex <- h2o.importFile(
path = prostatePath,
destination_frame = "prostate.hex"
)
glm <- h2o.glm(
y = "CAPSULE",
x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = prostate.hex,
family = "binomial",
nfolds = 0,
alpha = 0.5,
lambda_search = FALSE
)
# Model performance
perf <- h2o.performance(glm, newdata = prostate.hex)
# Extract info for ROC curve
curve_dat <- data.frame(perf#metrics$thresholds_and_metric_scores) %>%
select(c(tpr, fpr))
# Plot ROC curve
ggplot(curve_dat, aes(x = fpr, y = tpr)) +
geom_point() +
geom_line() +
geom_segment(
aes(x = 0, y = 0, xend = 1, yend = 1),
linetype = "dotted",
color = "grey50"
) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("ROC Curve") +
theme_bw()
Which produces this plot:
roc_plot