I am attempting to simulate data that approximates rate data - that is: count data that generally fit a negative binomial distribution but also an offset term for survey effort.
I think I can simulate counts well using the negative-binomial function (rnbinom()), but then don't have a way to account for the offset term - which will be random with each survey. Put another way:
What is the best way to simulate non-integer rate data?
Is there a realistic way to simulate the offset term values?
Do I need to use a different distribution beyond negative-binomial to generate a realistic range of non-integer values?
Background: Our surveys measure counts of individual per unit of survey effort (time), and the resulting rate is a positive non-integer ( >= 0). The survey count data seems well modeled with a negative-binomial distribution, and in a GLM framework, I would account for effort using survey time as an offset term. In the simulated data below, I generate a negative-binomial distribution to represent data within my actual survey. The offset term is simulated as a random uniform variable between 2-10 (the range of search times in minutes). Rate is then calculated as counts/time.
I plot histograms of both counts and rate to help demonstrate that rates here take many fractional values between integers. Because survey counts are often correlated with survey effort, it is critical that I ultimately use the rate data for analysis (i.e. figure 'B' below).
library(tidyverse)
theme_set(theme_classic())
d = data.frame(counts = rnbinom(n = 500,mu = 5,size = 1), ## dispersion parameter 'theta' set to 1
time = runif(500,2,10)) %>%
mutate(rate = counts/time)
## Count data histogram
ggplot(d, aes(x = counts))+
geom_histogram(fill = 'peachpuff',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('A: Histogram of Counts')
## Rate data histogram
ggplot(d, aes(x = rate))+
geom_histogram(binwidth = .1, fill = 'dodgerblue1',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
scale_y_continuous(expand = c(0,0))+
ylab('frequency')+
ggtitle('B: Histogram of Rate')
Below I can readily simulate counts from my original survey data, but don't know how to properly simulate back data as a rate. For example, if I fit an intercept-only nbinom GLM , I can use the coefficient to simulate new negative-binomial distributions of counts that are very similar to the original data (i.e. uses a similar value for 'mu')
[I realize this seems circular in this example, but this is my approach with real data. First describe the mean value and dispersion 'theta' with a GLM, then simulate back datasets that mimic my original dataset]
I use this approach below both to generate back count data, but also by fitting a model with the offset term in order to simulate back a distribution that has the mean 'rate' from figure 'B'.
### Simulate back count data from the original survey data:
##describe mean value 'mu' by finding intercept
## 'theta' could also be calculated
m1 = MASS::glm.nb(counts ~ 1, data = d)
# summary(m1)
# mean(d$counts)
# exp(m1$coefficients[1])
## simulated negative-binomial distribution using calculated 'mu'
d.sim = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m1$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## Plot and compare with plot 'A' above
ggplot(d.sim, aes(x = new.counts))+
geom_histogram(fill = 'peachpuff3',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('C: Simulated Counts')
###########################################
###########################################
### Simulate back 'rate' data by including an offset term for effort in the GLM model
## the exponentiated coefficient should equal the mean of the raw rate data
m2 = MASS::glm.nb(counts ~ 1 + offset(log(time)), data = d)
# summary(m2)
# mean(d$rate)
# exp(m2$coefficients[1])
d.sim.2 = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m2$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## compare these simulated 'rate' data with the non-integer 'true rate' data in figure D
ggplot(d.sim.2, aes(x = new.counts))+
geom_histogram(binwidth = .1, fill = 'dodgerblue3',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('D: Simulated Rate')
So it is at this point that I've generated figure 'C' as a simulated dataset representing counts that I have observed in real life, which closely matches the original data in figure 'A'. The 'rate' data in figure 'D' is (necessarily) all integer values drawn from rnbinom(), and while the mean of figure 'D' is approximate to the mean of figure 'B', my sense is that these two distributions are not really equivalent.
So my questions again:
Is there a way that I could instead simulate data to match figure 'B' (non-integer rate data)?
Do you think that data in figure 'D' will work as an approximate to 'B' since the mean values (and dispersion) are similar?
For additional context, I'll be using the simulated datasets (many of them) to run other Monte-Carlo type simulation analysis (e.g. power analysis). I'm worried that if I use data generated in Figure 'D', it won't really represent what my actual survey data will be (figure 'B').
The way you generate your sample data (in place of your empirical data), does not align with the data generating process you describe. The count data from rnbinom(n = 500, mu = 5, size = 1) does not depend on the time. mu needs to be a function of the time variable, or else the counts are independent of time.
Also, setting size = 1 means there is no overdispersion (nor underdispersion), thus it should rather be called a Poisson distribution, which is a special case of the negative binomial distribution. But given your description of the DGP it sounds like there would be overdispersion in the empirical data.
To answer your first question, you see a code example below. Regarding your second question, no I don't think that would be a good idea.
library(tidyverse)
library(rstanarm)
options(mc.cores = parallel::detectCores())
n <- 1000
empirical <-
tibble(
time = runif(n, 2, 10),
count = rnbinom(n = n, mu = time, size = 1) # Generate count data that actually depends on time
) |>
mutate(rate = count/time)
m_stan <- stan_glm.nb(count ~ time, data = empirical)
simulated <-
tibble(
time = runif(n, 2,10),
) %>%
mutate(
count = posterior_predict(m_stan, ., draws = 1) |>
as.vector(),
rate = count/time
)
d <- lst(simulated, empirical) |>
bind_rows(.id = "data")
d |>
select(data, count, rate) |>
pivot_longer(c(count, rate)) |>
ggplot() +
geom_histogram(aes(value), binwidth = .2) +
facet_grid(data ~ name, scales = "free")
Created on 2022-02-03 by the reprex package (v2.0.1)
Related
I have 32 months of data, and I'm trying out different models for testing the forecasting of unit transitions to dead state "X" for months 13-32, by training from transitions data for months 1-12. I then compare the forecasts with the actual data for months 13-32. The data represents the unit migration of a beginning population into the dead state over 32 months. Not all beginning units die off, only a portion. I understand that 12 months of data for model training isn't much and that forecasting for 20 months from those 12 months should result in a wide distribution of outcomes, those are the real-world limitations I usually grapple with.
I am using the fable package ETS model and would like to know how, or if it's possible, to set bounds for outputs when running simulations based on ETS. When I go to https://fable.tidyverts.org/reference/ETS.html to research setting bounds, the bounds argument as duplicated in the image below (perhaps I misunderstand what is meant by "bounds"), but those instructions don't say how to actually specify the lower and upper bounds:
When I run ETS on my data and plot out the forecast I get the following, where the forecast mean (in blue) at least visually reasonably hews to the actual data for those same months 13-32 (in black)(I have run other tests of residuals and autocorrelations, as well as run the benchmark methods recommended in the book, and this Holt's linear method looks fine based on those tests):
However, when I run simulations using that ETS model (code is presented at the bottom with simulation function flagged by #), I often get a maximum of transitions into dead state X for the forecast horizon (aggregate forecasted transitions during months 13-32) in excess of the beginning number of elements, which totals 60,000. In other words, there is no real-world scenario where transitions to dead state can exceed the beginning population! Is there a way to set an upper bound on the forecast distribution and the simulations so the total forecast doesn't exceed the cap of 60,000 possible transitions? While maintaining objective statistical integrity without injecting too much judgment?
I use a log-transformation of the data to prevent the forecast from falling negative. Negative value transitions aren't a real-world possibility for this data.
Below is the code for generating the above, and running simulations, including the dataset:
library(dplyr)
library(fabletools)
library(fable)
library(feasts)
library(ggplot2)
library(tidyr)
library(tsibble)
# my data
data <- data.frame(
Month =c(1:32),
StateX=c(
9416,6086,4559,3586,2887,2175,1945,1675,1418,1259,1079,940,923,776,638,545,547,510,379,
341,262,241,168,155,133,76,69,45,17,9,5,0
)
) %>%
as_tsibble(index = Month)
# fit the model to my data, generate forecast for months 13-32, and plot
fit <- data[1:12,] |> model(ETS(log(StateX) ~ error("A") + trend("A") + season("N")))
fc <- fit |> forecast(h = 20)
fc |>
autoplot(data) +
geom_line(aes(y = .fitted), col="#D55E00",
data = augment(fit)) +
labs(y="Unit transitions", title="Holt's linear method for transitions to dead state X") +
guides(colour = "none")
# run simulations and show aggregate nbr of transitions for months 13-32
sim <- fit %>% generate(h = 20, times = 5000, bootstrap = TRUE)
agg_sim <- sim %>% group_by(.rep) %>% summarise(sum_FC = sum(.sim),.groups = 'drop')
max(agg_sim[,"sum_FC"])
I ran the scaled logit transformation which offers a good solution, see https://otexts.com/fpp3/limits.html. However, observing Mitchell's 2nd comment "...likely here is that your model is inappropriate for the structure in the data", I explored this point further and am getting better results with a log-transformed ETS(M,A,N) model as opposed to the log-transformed ETS(A,A,N) model presented in the OP. The results are more reasonable with log-transformed ETS(M,A,N) than scaled-logit tranformed ETS(M,A,N). Using log ETS(M,A,N) I get the below point and interval forecast:
Researching ETS(M,A,N) further, there is a view that the conditional distribution from these "class 2" models is not Gaussian, so there are no formulae for the prediction intervals from these models although in some cases the Normal distribution might be used as an approximation for the real one, but simulations should generally be preferred. Therefore, running simulations for this data results in the following histogram which I take comfort from because the forecast simulated mean is a bit higher than actuals which is a desired outcome for the end purposes of this simulation, and the extreme outliers and what looks like a lognormal distribution is to be expected for this type of data which can be summarized as, when things go wrong in this process, they can go EXTREMELY BAD (outcomes can be really bad but never really good, sadly):
Below is code for the above (using packages from OP):
testDF <- data.frame(
Month =c(1:32),
StateX=c(
9416,6086,4559,3586,2887,2175,1945,1675,1418,1259,1079,940,923,776,638,545,547,510,379,
341,262,241,168,155,133,76,69,45,17,9,5,0
)) %>% as_tsibble(index = Month)
fit <- testDF[1:12,] |> model(ETS(log(StateX) ~ trend("A")))
fit |>
forecast(h = 20) |>
autoplot(testDF) +
labs(title = "Transition forecast versus actual data",
y = "Unit transitions",
x = "Months 13-32 are forecast periods")
sim <- fit |> generate(h = 20, times = 5000, bootstrap = TRUE)
agg_sim <- sim |>
as.data.frame() |>
group_by(.rep) |>
summarise(sumFC = sum(.sim),.groups = 'drop')
agg_sim %>%
ggplot(aes(x = sumFC)) +
geom_histogram(bins = 50) +
geom_vline(
aes(xintercept = mean(as.data.frame(agg_sim)[,"sumFC"]),
color = "forecast period simulation mean"),
linetype="solid",
size=1.5)+
geom_vline(
aes(xintercept = sum(as.data.frame(testDF[13:32,"StateX"])),
color = "forecast period actual data"
),
linetype="solid",
size=1.5) +
scale_color_manual(
name = "Vertical lines:",
values = c(
'forecast period actual data' = "blue",
'forecast period simulation mean' = "red"
)
) +
labs(title = "Histogram of simulations during forecast months 13-24",
x = 'Cumulative transitions during forecast period',
y = 'Bin counts')+
theme(legend.position = c(0.75,0.85)) +
xlim(0, 20000)
The following code was made up to replicate my problem with a bigger, more complex data set.
library(marginaleffects)
library(truncnorm)
yield_kgha<-rtruncnorm(n=100, mean=2000, sd=150)
n_kgha<-rtruncnorm(n=100, a=40, b=298, mean=150, sd=40)
i<-lm(yield_kgha~n_kgha+I(n_kgha^2))
summary(i)
I have used the predictions function from the marginaleffects package to see the predicted yields(yield_kgha) for a range of nitrogen rates (n_kgha) from my regression model (i). My original data only has n_kgha rates ranging from approximately 40-250, so the below code allows me to see predicted yields at n_kgha rates not actually in my data set.
p <- predictions(
i,
newdata = datagrid(model = i, n_kgha = seq(0, 300, by = 25), grid_type = "counterfactual"))
summary(p, by = "n_kgha")
I would like to plot the response of yield conditioned on n_kgha ranging from 0-300, which includes n_kgha values not in my original data set. I have tried to do this using the plot_cap function.
plot_cap(i, condition = "n_kgha")
However, since my original data only has n_kgha rates ranging from 40-250 I am not getting my desired result of seeing the response curve over the n_kgha (0:300) range. When I plot using the plot_cap function I get the following response curve with n_kgha ranging from 40-250 (the max and min of the original data set).
Is there a way to run the plot_cap function based on the counterfactual range of n_kgha as used in the predictions function? Or should I use another method to plot the predicted values based on counterfactual values?
The plot_cap() function only plots over the observed range of values. However, since predictions() returns a “tidy” data frame, it is trivial to use the the output of this function with ggplot2 to get the plot you need.
Note that we do not need to specify grid.type="counterfactual". This is option will do something very weird and specific: duplicate the whole dataset many times for each value of the user-supplied values. It is only useful in very specific corner-cases, and not when you just want to make predictions over unobserved values of the predictors. See the documentation with ?datagrid.
Here’s a simple example of predictions() with ggplot2 to achieve what you want:
library(marginaleffects)
library(truncnorm)
library(ggplot2)
yield_kgha <- rtruncnorm(n = 100, mean = 2000, sd = 150)
n_kgha <- rtruncnorm(n = 100, a = 40, b = 298, mean = 150, sd = 40)
i <- lm(yield_kgha ~ n_kgha + I(n_kgha^2))
p <- predictions(i, newdata = datagrid(n_kgha = seq(0, 300, 10)))
ggplot(p, aes(n_kgha, predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_line() +
theme_minimal()
Fish age-at-maturity is where there is a change in the slope of the growth rate. Here is an example of a simulated individual fish and its two growth rates.
I want to create an algorithm that will determine age-at-maturity from age and length data similar to the picture I attached. I have very little idea on what kind of algorithm would be useful and how to apply it to my sample data set:
> head(data)
age length
1 0 0.01479779
2 1 0.05439856
3 2 0.18308919
4 3 0.24380771
5 4 0.37759992
6 5 0.44871502
It was suggested to me to try and use the Cox Proportional Hazards model. To do that I would consider age-at-maturity as a time to event (maturity is the event and age is the time when maturity is reached). I tried fitting that model but got this error:
> cox.model <- coxph(Surv(age ~ length), data = data)
Error in Surv(age ~ length) : Time variable is not numeric
I tried making both variables numeric using as. numeric() but that did not help.
Please let me know if I am using this model wrong or if I should be using a different model.
As I know, time-to-event data should include an event indicator, i.e. a binary variable. If maturity is the event, then it should have been included in the dataset as such a binary variable, and you should run this
cox.model <- coxph(Surv(age, maturity) ~ length, data = data)
Please check manual for more details
Survival package
Cox model
BTW, the figure was created by something like segmented regression and ggplot, I think you may want to use such tech. Here is an example.
I agree with #C.C., 1) a survival model is not applicable for this provided dataset and 2) a simple piecewise linear regression method would be more appropriate.
Please see below the proposed R code for it, together with a sample output graph:
library(segmented)
# create dummy data set, extended from provided one, with noise
df <- data.frame(
age = seq(from = 0, to = 20, by = 1),
length = c(
seq(from = 0, to = 0.45, length.out = 5) + rnorm(5, mean = 1e-3, sd = 1e-2),
seq(from = 0.48, to = 0.6, length.out = 16) + rnorm(16, mean = 1e-3, sd = 1e-2)
)
)
# fit normal linear regression and segmented regression
lm1 <- lm(length ~ age, data = df)
seg_lm <- segmented(lm1, ~ age)
# determine age break point
age_break_point <- seg_lm$psi.history$all.selected.psi[[1]]
# plot raw data points, segmented fit and age break point
plot(seg_lm, res = TRUE, main=paste0('Growth rate change # ', round(age_break_point, 1), ' years old'), xlab='Age', ylab='Length')
abline(v = age_break_point, col='red')
I have a fictional dataframe like so, including continuous and categorical variables
library(dplyr)
library(ggplot2)
library(tidyr)
df <- tibble(
# each sample gets id from 1:1000
id = 1:1000,
# sex,categorical, either "f" or "m"
sex = ifelse(runif(1000, 0, 1) < 0.5, "f","m"),
# disease stage, categorical, either 1 or 2
stage = ifelse(runif(1000,0,1) < 0.5, 1,2),
# age, continuous
age = runif(1000,20,80),
# blood, continuous
blood = runif(1000,10,1000)
)
The categorical variables have a nearly 50:50 distribution
prop.table(table(df$sex))
prop.table(table(df$stage))
And the continuous have a rather arbitrary, non normal distribution
df %>%
gather(test, result, 4:5) %>%
ggplot(aes(result)) +
geom_density() +
facet_wrap(test ~ ., scale="free")
If I now take 100 samples from the df, the resulting distributions are entirely different from the initial distribution
sample_df <- sample_n(df, 100, replace=F)
sample_df %>%
gather(test, result, 4:5) %>%
ggplot(aes(result)) +
geom_density() +
facet_wrap(test ~ ., scale="free")
My question is now, how would I sample from df so that my sample_df follows the distribution and propability of all of my parameters (sex, age, stage, blood). I thought about fitting a regression model to the df and picking samples based on the residuals, hence the distance of each sample to the regression line.
The actual underlying problem is a large cohort of patient data from which I want to pick a subcohort while preserving the distribution and propability of certain patient and disease characteristics.
Any help is highly appreciated.
Edit
I g
Know that a sample of 1/10 of the population is too small and that increasing the sample size will make the distribuate approximate that of the population it was drawn from. To make my situation more clear, for me it is not manageable to use more than let's say 1/4 of my population. And with every draw from the population there is some risk that I pick a very unrepresentative cohort (sampling error). So basically I'm looking for a method to minimize this risk and to maximize the chance that my sample is the most accurate representation of the population.
Your base population is sampled from a uniform distribution. Even with a 1000 individuals, you can see from your figures that there is some "non-uniformness" to it. Your sample population is then just 100 individuals. By chance you will sample something that resembles but does not perfectly reflect your base population or a uniform distribution. The code below shows a comparison between sample populations of 100 individuals and 20000 individuals.
x1 <- runif(100000,0,1)
plot(NULL, xlim = c(0,1), ylim = c(0,1.2))
for(i in 1:20){
points(density(sample(x1, 100)), typ = "l", col = "red")
points(density(sample(x1, 20000)), typ = "l", col = "black")
}
Okay, I think I figured what I actually wanted. Stratified sampling. Basically define strata based on the frequency of certain parameters and sample from them.
Here's some further reading on that
The response variable for my dataset is comprised of observations Y[1], Y[2], ...., Y[49]. I came up with a Bayesian Hierarchical Model to make Bayesian predictions for Y[50]. I also have MCMC samples for Y[1],...,Y[49], which I can use to assess the overall fit of my Bayesian model by comparing them with the actual values of Y[1], Y[2], ...., Y[49].
Is there any way that I can draw the caterpillar plots of my Bayesian Predictions from the MCMC object of the Hierarchical Model along with the points that stands for actual observed Y's from my original dataset on R?
Thank you,
First you need to extract your confidence intervals for each $Y_i$ . (usually this is done with quantile function if you're not using a standard S3 object).
Then you create the following df:
df <- data_frame(
obs = seq(from = 1,
to = 49,
by = 1),
lower = q1,
upper = q2,
estimate = estimate,
actual = actual)
Then you go:
df %>% ggplot(aes(x = obs)) +
geom_line(aes(y = actual)) +
geom_pointrange(aes(ymin = lower, ymax = upper, y = estimate)) +
coord_flip()
If you're doing hierarchical models I really recommend using rstanarm package which is compatible with the tidybayes library (which produces automatic caterpillar plots).