Simulating an SIR model in R. I have a data set I am trying to plot accurately with the model. I am right now using the particle filter function, then would like to use the corresponding logLik method on the result. When I do this, I get "[1] -Inf" as a result. I can't find in the documentation why this is and how I can avoid it. Are my parameters for the model not accurate enough? Is there something else wrong?
My function looks like this:
SIRsim %>%
pfilter(Np=5000) -> pf
logLik(pf)
From an online course lesson entitled Likelihood for POMPS https://kingaa.github.io/sbied/pfilter/ , this is the R script for the lesson. However, the code works here... I'm not sure how to reproduce my specific problem with it and unfortunately cannot share the dataset or code I am using because it is for academic research.
library(tidyverse)
library(pomp)
options(stringsAsFactors=FALSE)
stopifnot(packageVersion("pomp")>="3.0")
set.seed(1350254336)
library(tidyverse)
library(pomp)
sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 1;
R = nearbyint((1-eta)*N);
H = 0;
")
dmeas <- Csnippet("
lik = dbinom(reports,H,rho,give_log);
")
rmeas <- Csnippet("
reports = rbinom(H,rho);
")
read_csv("https://kingaa.github.io/sbied/pfilter/Measles_Consett_1948.csv")
%>%
select(week,reports=cases) %>%
filter(week<=42) %>%
pomp(
times="week",t0=0,
rprocess=euler(sir_step,delta.t=1/7),
rinit=sir_init,
rmeasure=rmeas,
dmeasure=dmeas,
accumvars="H",
statenames=c("S","I","R","H"),
paramnames=c("Beta","mu_IR","eta","rho","N"),
params=c(Beta=15,mu_IR=0.5,rho=0.5,eta=0.06,N=38000)
) -> measSIR
measSIR %>%
pfilter(Np=5000) -> pf
logLik(pf)
library(doParallel)
library(doRNG)
registerDoParallel()
registerDoRNG(652643293)
foreach (i=1:10, .combine=c) %dopar% {
measSIR %>% pfilter(Np=5000)
} -> pf
logLik(pf) -> ll
logmeanexp(ll,se=TRUE)
If I set Beta=100 in the code above I can get a negative-infinite log-likelihood.
Replacing the measurement-error snippet with this:
dmeas <- Csnippet("
double ll = dbinom(reports,H,rho,give_log);
lik = (!isfinite(ll) ? -1000 : ll );
")
appears to 'solve' the problem, although you should be a little bit careful; papering over numerical cracks like this is sometimes OK, but could conceivably come back to bite you in some way later on. If you just need to avoid non-finite values long enough to get into a reasonable parameter range this might be OK ...
Some guesses as to why this is happening:
you are somehow getting an "impossible" situation like a positive number of reported cases when the underlying true number of infections is zero.
Sometimes non-finite log-likelihoods occur when a very small positive probability underflows to zero. The equivalent here is likely that the probability of infection 1-exp(-Beta*I/N*dt) goes to 1.0; then any observed outcome where less than 100% of the population is infected is impossible.
You can try to diagnose the situation by seeing what the filtered trajectory actually looks like and comparing it with the data, or by adding debugging statements to the code. If there's a way to run just the deterministic simulation with your parameter values that might tell you pretty quickly what's going wrong.
An easier/more direct way to debug would be to replace the Csnippet you're using for dmeas with an R function: this will be slower but easier to work with (especially if you're not familiar with C coding). If you uncomment the browser() statement below, the code will drop into debug mode when you encounter the bad situation ...
dmeas <- function(reports,H,rho,log, ...) {
lik <- dbinom(reports,size=H,prob=rho,log=log)
if (!is.finite(lik)) {
lik <- -1000
## browser()
}
return(lik)
}
For example:
(t = 3, reports = 2, S = 2280, I = 0, R = 35721, H = 0, Beta = 100,
mu_IR = 0.5, rho = 0.5, eta = 0.06, N = 38000, log = TRUE)
Browse[1]> debug at /tmp/SO65554258.R!ZlSILG#7: return(lik)
Browse[2]> reports
[1] 2
Browse[2]> H
[1] 0
Browse[2]> rho
[1] 0.5
This shows that the problem is indeed that you have a positive number of reported cases when there have been zero infections ... R is trying to compute the binomial probability of observing reports cases out when there are H infections that are potentially reportable, each reported with a probability rho. When the number of trials N in a binomial probability Binom(N,p) is zero, the only possible outcome is zero 'successes' (reported cases), with probability 1. All other outcomes have probability 0 (and log-probability -Inf).
Related
I have wracked my brain trying to come up with a solution to this problem and I'm at wits end! First, the necessary context: Aquatic plants in lakes are sampled with rakes. You throw a rake out into the lake, you pull it back into your boat, and you figure out what plants are on its tines. In our case, we measure both presence/absence as well as "abundance," but in an ordinal/interval-censored way --> it's 0 if species X isn't noticed on the rake at all, 1 if it covers < 25% of the rake's tines, 2 if it covers between 25 and 75%, and 3 if it covers > 75%. However, it's fairly easy to miss a species entirely when it's in low abundance, so 0s are sketchy--they may not represent true absences, and that is really the issue our model is trying to explore.
So, there are really three layers here--a true, fully latent abundance that we don't observe directly at all, a partially latent "true presence/absence" in that we know where true presences are but not where true absences are, and then we have our observed presence/absence data. What's more interesting is that we think some environmental variables may affect both true abundance and true occurrence but differently, and then other variables may affect detectability, and it's those processes we're trying to tease apart.
So, anyhow, my actual model is much larger and more complicated than what I've pasted below, but here is a sort of functional (but probably academically meritless) training version of it that replicates the error I am getting.
#data setup
N = 1500 #Number of cases
obs = sample(c(0,1,2,3), N,
replace=T, prob=c(0.7, 0.2, 0.075, 0.025)) #Our observed, interval-censored data.
X1 = rnorm(N) #Some covariate that probably affects both occurrance and abundance but maybe in different ways.
abundances = rep(NA, times = N) #Abundance is a latent variable we don't directly observe. From elsewhere, I know the values here need to be NAs so the model will know to impute them
occur = rep(1, times = N) #Occurance is a degraded form of our abundance data.
#d will be the initials for the abundance data, since this is apparently needed to jumpstart the imputation.
d = vector()
for(o in 1:N) {
if (obs[o]==0) { d[o] = 0.025; occur[o] = 0 }
if (obs[o]==1) { d[o] = 0.15 }
if (obs[o]==2) { d[o] = 0.5 }
if (obs[o]==3) { d[o] = 0.875 }
}
#Data
test.data = list("N" = N,
"obs" = obs,
"X1" = X1,
"abund" = abundances,
"lim" = c(0.05, 0.25, 0.75, 0.9999),
"occur" = occur)
#Inits
inits = list(abund = d)
cat("model
{
for (i in 1:N) {
obs[i] ~ dinterval(abund[i], lim)
abund[i] ~ dbeta(theta[i], rho[i]) T(0.0001, 0.9999)
theta[i] <- mu[i] * epsilon
rho[i] <- epsilon * (1-mu[i])
logit(mu[i]) <- alpha1 + X.beta1 * X1[i]
occur[i] ~ dbern(phi[i])
logit(phi[i]) <- alpha2 + X.beta2 * X1[i]
}
#Priors
epsilon ~ dnorm(5, 0.1) T(0.01, 10)
alpha1 ~ dnorm(0, 0.01)
X.beta1 ~ dnorm(0, 0.01)
alpha2 ~ dnorm(0, 0.01)
X.beta2 ~ dnorm(0, 0.01)
}
", file = "training.txt")
test.run = jags.model(file = "training.txt", inits = inits, data=test.data, n.chains = 3)
params = c("epsilon",
"alpha1",
"alpha2",
"X.beta1",
"X.beta2")
run1 = run.jags("training.txt", data = test.data, n.chains=3, burnin = 1000, sample = 5000, adapt = 4000, thin = 2,
monitor = c(params), method="parallel", modules = 'glm')
At the end, I get this error, and I always get this error any time I try to do something even remotely like this:
Graph information: Observed stochastic nodes: 3000 Unobserved
stochastic nodes: 1505 Total graph size: 19519 . Reading
parameter file inits1.txt. Initializing model Error in node obs1
Node inconsistent with parents
I've read every posting that covers this error I can find, including this one, this one, this one, and this one. I can surmise from my research and testing that the error is probably occurring for one of the following reasons.
My initials for the latent abundance variable are not adequate somehow. It sounds like this requires pretty useful initial values to work.
One or more of my priors is allowing values that are not permissible OR they are too broad and that's causing problems somehow. This might be especially an issue because of the beta distribution I am using which has strong requirements about not having values outside of 0 and 1.
I am using the dinterval() function incorrectly, which seems likely because it is always the line containing it that trips the error.
My model is somehow mis-specified.
But I can't see where I might be going wrong--I have tried a number of different options for 1 and 2, and so far as I can tell from the documentation (see pages 55-56), I am using dinterval correctly. What am I missing??
In case it's relevant, from what I have gathered, the idea of dinterval() is that the variable on the left of the ~ is the interval-censored version of the variable given in the first argument (here, abundance). Then, the second argument (here, lim) is a vector of "breakpoints" that dictate which intervals the abundance data end up in. So, here, you end up with an observed abundance code of 0 if you are lower than the lowest lim (here, 0.05), 1 if you are in between the first two values in lim, etc. It's like the abundance variable is being pushed through a "binning sieve" created by the lim variable to produce a binned output variable, our observed abundances.
Any guidance would be most welcome!!
I have run your example with JAGS 4.3.0 and rjags 4-12. For me, the version with rjags runs correctly. The version with runjags does not work because you have not provided intial values. This is easily fixed by adding the argument
inits=list(inits, inits, inits)
to the call to run.jags().
You have correctly understood the purpose of dinterval. This is an "observable function" which imposes constraints on its parameters via a likelihood. When using dinterval you must always provide initial values that satisfy the constraints from the fist iteration. As far as I can see, your initial values do satisfy the constraints and this is verified by the fact that I can run your example (with initial values).
I asked a similar question on CrossValidated, but did not get a response. I went ahead anyway, and built out a function but am having a problem with replication...
The original question, posted here is as such:
I am seeking a function (or short algorithm, ideally implemented in R) that produces something similar to the following:
See, I would like to be able to generate a vector of n items that follows this sort of pattern, mapped to a set of inputs (say, seq(1:n)). Ideally, I would be able to tell the algorithm to "spike" to a maximum height h on every kth time period, and decay at rate r. However, I would be sufficiently happy with simply being able to generate a spike pattern that occurs periodically.
I wrote some code in R, which is included here, that works fairly well...
## Neural Networks / Deep Learning ##
# first, must install Python from:
# https://www.anaconda.com/download/#windows
# https://www.python.org/downloads/
if (!require(keras)) devtools::install_github("rstudio/keras") ; library(keras)
# install_tensorflow()
spikes_model <- function(maxiter, total_spikes = 10, max_height = 0.001, min_height = 0.000005, decay_rate = 1) {
value_at_iteration <- rep(0, maxiter)
spike_at <- maxiter / total_spikes
current_rate <- min_height
holder_timeval <- 0
for(i in 1:maxiter) {
spike_indicator <- i / spike_at
if (is.integer(spike_indicator)) {
current_rate <- max_height
value_at_iteration[i] <- current_rate
holder_timeval <- spike_indicator
} else if (i < spike_at) {
current_rate <- min_height
value_at_iteration[i] <- current_rate
} else {
timeval <- i - (holder_timeval*spike_at)
current_rate <- max_height*exp(-decay_rate*timeval) + min_height
value_at_iteration[i] <- current_rate
}
}
return(value_at_iteration)
}
asdf <- spikes_model(maxiter = 100)
plot(asdf, type="l")
... which results in the following plot:
This is exactly what I want, except there is only one spike. I know there is a code or logic error somewhere, but I can not find where I am going wrong. Please help me replicate this spike procedure across time.
The code this scheduler is used in:
eps <- 1000
sch <- spikes_model(eps)
lr_schedule <- function(epoch, lr) {
lrn <- sch[as.integer(epoch)]
lrn <- k_cast_to_floatx(lrn)
return(lrn)
}
## Add callback to automatically adjust learning rate downward when training reaches plateau ##
reduce_lr <- callback_learning_rate_scheduler(lr_schedule)
## Fit model using trainig data, validate with validation data ##
mod1.hst <- mod1 %>% fit(
x=X.train, y=Y.train,
epochs=eps, batch_size=nrow(X.train),
validation_data = list(X.val, Y.val),
shuffle=TRUE, callbacks = list(checkpoint, reduce_lr)
)
Wow, I just figured out my own error. I was using the is.integer() function, which does not work how I wanted. I needed to use the is.whole.number() function from mosaic.
Fixing that single error, I find the following chart, which is exactly what I wanted.
i am trying to build a recursive function in R,
H(x,t) = \sum\limits_{d=0}^{x} (Pr(D=d)*(h*(x-d)+H(x-d,t-1)))
+ \sum\limits_{d=x+1}^{\infty} (Pr(D=d)*(p(*d-x)+ H(0,t-1)))
Where h,p are some constants, D ~ Po(l) and H(x,0) = 0, the are code i have done so far, gives an obvious error, but i can't see the fix. The code
p<- 1000 # Unit penalty cost for lost sales
h<- 10 # Unit inventory holding cost pr. time unit
l<- 5 # Mean of D
H <- function(x,t){
if(t==0)(return(0))
fp <- 0
sp <- 0
for(d in 0:x){
fp <- fp + dpois(x=d,l)*(h*(x-d)+H(x-d,t-1))
}
for(d in x+1:Inf){
sp <- sp + dpois(x=d,l)*(p*(d-x)+H(0,t-1))
}
return(fp+sp)
}
When i run this, the error is
Error in 1:Inf : result would be too long a vector
Which, seems obvious, so the question is, can anyone point me in the direction to redefine the problem, so i can get R to bring me a solution?
Thanks in advance.
Going from x+1:Inf won't work. Since you're using poisson's pdf, you can just add a upper bound (why? think about the shape of the pdf and how small the values are at the right tail):
for(d in x+1:100)
which when ran for H(20,2) gives
[1] 252.806
when you increase it to
for(d in x+1:500)
then H(20,2) also gives
[1] 252.806
I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.
I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.
I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R
In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.
glm.nb throws an unusual error on certain inputs. While there are a variety of values that cause this error, changing the input even very slightly can prevent the error.
A reproducible example:
set.seed(11)
pop <- rnbinom(n=1000,size=1,mu=0.05)
glm.nb(pop~1,maxit=1000)
Running this code throws the error:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
At first I assumed that this had something to do with the algorithm not converging. However, I was surprised to find that changing the input even very slightly can prevent the error. For example:
pop[1000] <- pop[1000] + 1
glm.nb(pop~1,maxit=1000)
I've found that it throws this error on 19.4% of the seeds between 1 and 500:
fit.with.seed = function(s) {
set.seed(s)
pop <- rnbinom(n=1000, size=1, mu=0.05)
m = glm.nb(pop~1, maxit=1000)
}
errors = sapply(1:500, function(s) {
is.null(tryCatch(fit.with.seed(s), error=function(e) NULL))
})
mean(errors)
I've found only one mention of this error anywhere, on a thread with no responses.
What could be causing this error, and how can it be fixed (other than randomly permuting the inputs every time glm.nb throws an error?)
ETA: Setting control=glm.control(maxit=200,trace = 3) finds that the theta.ml algorithm breaks by getting very large, then becoming -Inf, then becoming NaN:
theta.ml: iter67 theta =5.77203e+15
theta.ml: iter68 theta =5.28327e+15
theta.ml: iter69 theta =1.41103e+16
theta.ml: iter70 theta =-Inf
theta.ml: iter71 theta =NaN
It's a bit crude, but in the past I have been able to work around problems with glm.nb by resorting to straight maximum likelihood estimation (i.e. no clever iterative estimation algorithms as used in glm.nb)
Some poking around/profiling indicates that the MLE for the theta parameter is effectively infinite. I decided to fit it on the inverse scale, so that I could put a boundary at 0 (a fancier version would set up a log-likelihood function that would revert to Poisson at theta=zero, but that would undo the point of trying to come up with a quick, canned solution).
With two of the bad examples given above, this works reasonably well, although it does warn that the parameter fit is on the boundary ...
library(bbmle)
m1 <- mle2(Y~dnbinom(mu=exp(logmu),size=1/invk),
data=d1,
parameters=list(logmu~X1+X2+offset(X3)),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(rep(-Inf,12),1e-8))
The second example is actually more interesting because it demonstrates numerically that the MLE for theta is essentially infinite even though we have a good-sized data set that is exactly generated from negative binomial deviates (or else I'm confused about something ...)
set.seed(11);pop <- rnbinom(n=1000,size=1,mu=0.05);glm.nb(pop~1,maxit=1000)
m2 <- mle2(pop~dnbinom(mu=exp(logmu),size=1/invk),
data=data.frame(pop),
start=list(logmu=0,invk=1),
method="L-BFGS-B",
lower=c(-Inf,1e-8))
Edit: The code and answer has been simplified to one sample, like in the question.
Yes, theta can approach Inf in small samples and sparse data (many zeroes, small mean and large skew). I have found that fitting glm.nb fails when the data are all zeroes and returns:
Error in while ((it <- it + 1) < limit && abs(del) > eps) { :
missing value where TRUE/FALSE needed
The following code simulates small samples with a small mean and theta. To prevent the loop from crashing, glm.nb is not fitted when the data are all zeroes.
en1 <- 10
mu1 <- 0.5
size1 <- 0.5
temp <- matrix(nrow=10000, ncol=2)
# theta == Inf is rare so use a large number of reps
for (r in 1:10000){
dat1 <- rnbinom(n=en1, size=size1, mu=mu1)
temp[r, 1:2] <- c(mean(dat1), ifelse(max(dat1)!=0, glm.nb(dat1~1)$theta, NA))
}
temp <- as.data.frame(temp)
names(temp) <- c("mean1","theta1")
temp[which(is.na(temp$theta1)),]
# note that it's rare to get all zeroes in the sample
sum(is.na(temp$theta1))/dim(temp)[1]
# a log scale helps see what's happening
with(temp, plot(mean1, log10(theta1)))
# estimated thetas should equal size1 = 0.5
abline(h=log10(0.5), col="red")
text(2.5, 5, "n1 = n2 = 10", col="red", cex=2, adj=1)
text(1, 4, "extreme thetas", col="red", cex=2)
See that estimated thetas can be extremely large when the sample size is small (in the first plot below):
Lesson learnt: don't expect high quality results from glm.nb for small samples and sparse data; get larger samples (e.g. in the second plot below).