I am new to WinBUGS/OpenBUGS and having difficulty debugging my code.
Does anyone know of a list of potential error messages for BUGS models and their meanings in plain English?
The WinBUGS manual has a list of some common error. I have added some additional notes from my own experience:
expected variable name indicates an inappropriate variable name. I occasionally get this error in providing the data, might have used 1.02e04 instead of 1.02E04.
undefined variable - variables in a data file must be defined in a model (just put them in as constants or with vague priors). If a logical node is reported undefined, the problem may be with a node on the 'right hand side'. I occasionally get this error when I have removed a variable from the model but not from the data or missed a comma in the data.
invalid or unexpected token scanned - check that the value field of a logical node in a Doodle has been completed.
index out of range - usually indicates that a loop-index goes beyond the size of a vector (or matrix dimension); sometimes, however, appears if the # has been omitted from the beginning of a comment line
linear predictor in probit regression too large indicates numerical overflow. See possible solutions below for Trap 'undefined real result'.
logical expression too complex - a logical node is defined in terms of too many parameters/constants or too many operators: try introducing further logical nodes to represent parts of the overall calculation; for example, a1 + a2 + a3 + b1 + b2 + b3 could be written as A + B where A and B are the simpler logical expressions a1 + a2 + a3 and b1 + b2 + b3, respectively. Note that linear predictors with many terms should be formulated by 'vectorizing' parameters and covariates and by then using the inprod(.,.) function
unable to choose update method indicates that a restriction in the program has been violated
You might also hit a trap at the start or during the MCMC. The BUGS manual list the following common traps (I always get the first two, never met the last two):
undefined real result indicates numerical overflow. Possible reasons include:
initial values generated from a 'vague' prior distribution may be numerically extreme - specify appropriate initial values;
numerically impossible values such as log of a non-positive number - check, for example, that no zero expectations have been given when Poisson modelling;
numerical difficulties in sampling. Possible solutions include:
better initial values;
more informative priors - uniform priors might still be used but with their range restricted to plausible values;
better parameterisation to improve orthogonality;
standardisation of covariates to have mean 0 and standard deviation 1.
can happen if all initial values are equal.Probit models are particularly susceptible to this problem, i.e. generating undefined real results. If a probit is a stochastic node, it may help to put reasonable bounds on its distribution, e.g.
probit(p[i]) <- delta[i]
delta[i] ~ dnorm(mu[i], tau)I(-5, 5)
This trap can sometimes be escaped from by simply clicking on the update button. The equivalent construction
p[i] <- phi(delta[i])
may be more forgiving.
index array out of range
possible reasons include:
attempting to assign values beyond the declared length of an array;
if a logical expression is too long to evaluate break it down into smaller components.
stack overflow can occur if there is a recursive definition of a logical node.
NIL dereference (read) can occur at compilation in some circumstances when an inappropriate transformation is made, for example an array into a scalar.
Trap messages referring to DFreeARS indicate numerical problems with the derivative-free adaptive rejection algorithm used for log-concave distributions. One possibility is to change to "Slice" sampling
This WinBUGS User Manual might be of some use.
Related
My goal is to understand how Eva shrinks the intervals for a variable. for example:
unsigned int nondet_uint(void);
int main()
{
unsigned int x=nondet_uint();
unsigned int y=nondet_uint();
//# assert x >= 20 && x <= 30;
//# assert y <= 60;
//# assert(x>=y);
return 0;
}
So, we have x=[20,30] and y=[0,60]. However, the results from Eva shrinks y to [0,30] which is where the domain may be valid.
[eva] ====== VALUES COMPUTED ======
[eva:final-states] Values at end of function main:
x ∈ [20..30]
y ∈ [0..30]
__retres ∈ {0}
I tried some options for the Eva plugin, but none showed the steps for it. May I ask you to provide the method or publication on how to compute these values?
Showing values during abstract interpretation
I tried some options for the Eva plugin, but none showed the steps for it.
The most efficient way to follow the evaluation is not via command-line options, but by adding Frama_C_show_each(exp) statements in the code. These are special function calls which, during the analysis, emit the values of the expression contained in them. They are especially useful in loops, for instance to see when a widening is triggered, what happens to the loop counter values.
Note that displaying all of the intermediary evaluation and reduction steps would be very verbose, even for very small programs. By default, this information is not exposed, since it is too dense and rarely useful.
For starters, try adding Frama_C_show_each statements, and use the Frama-C GUI to see the result. It allows focusing on any expression in the code and, in the Values tab, shows the values for the given expression, at the selected statement, for each callstack. You can also press Ctrl+E and type an arbitrary expression to have its value evaluated at that statement.
If you want more details about the values, their reductions, and the overall mechanism, see the section below.
Detailed information about values in Eva
Your question is related to the values used by the abstract interpretation engine in Eva.
Chapter 3 of the Eva User Manual describes the abstractions used by the engine, which are, succinctly:
sets of integers, which are maximally precise but limited to a number of elements (modified by option -eva-ilevel, which on Frama-C 22 is set to 8 by default);
integer intervals with periodicity information (also called modulo, or congruence), e.g. [2..42],2%10 being the set containing {2, 12, 22, 32, 42}. In the simple case, e.g. [2..42], all integers between 2 and 42 are included;
sets of addresses (for pointers), with offsets represented using the above values (sets of integers or intervals);
intervals of floating-point variables (unlike integers, there are no small sets of floating-point values).
Why is all of this necessary? Because without knowing some of these details, you'll have a hard time understanding why the analysis is sometimes precise, sometimes imprecise.
Note that the term reduction is used in the documentation, instead of shrinkage. So look for words related to reduce in the Eva manual when searching for clues.
For instance, in the following code:
int a = Frama_C_interval(-5, 5);
if (a != 0) {
//# assert a != 0;
int b = 5 / a;
}
By default, the analysis will not be able to remove the 0 from the interval inside the if, because [-5..-1];[1..5] is not an interval, but a disjoint union of intervals. However, if the number of elements drops below -eva-ilevel, then the analysis will convert it into a small set, and get a precise result. Therefore, changing some analysis options will result in different ranges, and different results.
In some cases, you can force Eva to compute using disjunctions, for instance by adding the split ACSL annotation, e.g. //# split a < b || a >= b;. But you still need the give the analysis some "fuel" for it to evaluate both branches separately. The easiest way to do so is to use -eva-precision N, with N being an integer between 0 and 11. The higher N is, the more splitting is allowed to happen, but the longer the analysis may take.
Note that, to ensure termination of the analysis, some mechanisms such as widening are used. Without it, a simple loop might require billions of evaluation steps to terminate. This mechanism may introduce extra values which lead to a less precise analysis.
Finally, there are also some abstract domains (option -eva-domains) which allow other kinds of values besides the default ones mentioned above. For instance, the sign domain allows splitting values between negative, zero and positive, and would avoid the imprecision in the above example. The Eva user manual contains examples of usage of each of the domains, indicating when they are useful.
Can somebody tell me why this author has used the following code in their normalisation.
The first line appears fine to me they have standardised the training set by the following formula;
(x - mean(x)) / std(x)
However the second line and third line (validation and test) they have used the train mean (trainme) and train standard deviation (trainstd). Should they not have used the validation mean (validationme) and validation standard deviation (validationstd) along with the test mean and test standard deviation?
You can also view the page from the book at the following link (page 173)
What the authors are doing is reasonable and it's what is conventionally done. The idea is that the same normalization is applied to all inputs. This is essentially allocating some new parameters (offset and scale) and estimating them from the training data. In that scheme, if the value 100 is input, then the normalized value is (100 - offset)/scale, no matter where (training, testing, whatever) that 100 came from.
I guess one can also make an argument that the offset and scale should be context dependent in the sense that if you are given a set of data and for some reason the offset and scale are very different from the original training data, maybe what's important is how big each value is relative to the others in the same data set. E.g. maybe you should treat 200 the same as 100, if the scale is twice as big in the data set containing 200.
Whether that data-dependent scaling is reasonable would have to be decided case by case. I don't remember ever having seen it, but it's plausible that it could be the right thing to do in some cases.
By the way, you'll get more interest in general statistical questions at stats.stackexchange.com and/or datascience.stackexchange.com.
As I was working out how Epi generates the basis for its spline functions (via the function Ns), I was a little confused by how it handles the detrend argument.
When detrend=T I would have expected that Epi::Ns(...) would more or less project the basis given by splines::ns(...) onto the orthogonal complement of the column space of [1 t] and finally extract the set of linearly independent columns (so that we have a basis).
However, this doesn’t appear to be the exactly the case; I tried
library(Epi)
x=seq(-0.75, 0.75, length.out=5)
Ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1), detrend=T)
and
library(splines)
detrend(ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1)), x)
The matrices produced by the above code are not the same, however, they do have the same column space (in this example) suggesting that if plugged in to a linear model, the fitted coefficients will be different but the fit (itself) will be the same.
The first question I had was; is this true in general?
The second question is why are the two different?
Regarding the second question - when detrend is specified, Epi::Ns gives a warning that fixsl is ignored.
Diving into Epi github NS.r ... in the construction of the basis, in the call to Epi::Ns above with detrend=T, the worker ns.ld() is called (a function almost identical to the guts of splines::ns()), which passes c(NA,NA) along to splines::spline.des as the derivs argument in determining a matrix const;
const <- splines::spline.des( Aknots, Boundary.knots, 4, c(2-fixsl[1],2-fixsl[2]))$design
This is the difference between what happens in Ns(detrend=T) and the call to ns() above which passes c(2,2) to splineDesign as the derivs argument.
So that explains how they are different, but not why? Does anyone have an explanation for why fixsl=c(NA,NA) is used instead of fixsl=c(F,F) in Epi::Ns()?
And does anyone have a proof/or an answer to the first question?
I think the orthogonal complement of const's column space is used so that second (or desired) derivatives are zero at the boundary (via projection of the general spline basis) - but I'm not sure about this step as I haven't dug into the mathematics, I'm just going by my 'feel' for it. Perhaps if I understood this better, the reason that the differences in the result for const from the call to splineDesign/spline.des (in ns() and Ns() respectively) would explain why the two matrices from the start are not the same, yet yield the same fit.
The fixsl=c(NA,NA) was a bug that has been fixed since a while. See the commits on the CRAN Github mirror.
I have still sent an email to the maintainer to ask if the fix could be made a little bit more consistent with the condition, but in principle this could be closed.
Had a tough time thinking of an appropriate title, but I'm just trying to code something that can auto compute the following simple math problem:
The average value of a,b,c is 25. The average value of b,c is 23. What is the value of 'a'?
For us humans we can easily compute that the value of 'a' is 29, without the need to know b and c. But I'm not sure if this is possible in programming, where we code a function that takes in the average values of 'a,b,c' and 'b,c' and outputs 'a' automatically.
Yes, it is possible to do this. The reason for this is that you can model the sort of problem being described here as a system of linear equations. For example, when you say that the average of a, b, and c is 25, then you're saying that
a / 3 + b / 3 + c / 3 = 25.
Adding in the constraint that the average of b and c is 23 gives the equation
b / 2 + c / 2 = 23.
More generally, any constraint of the form "the average of the variables x1, x2, ..., xn is M" can be written as
x1 / n + x2 / n + ... + xn / n = M.
Once you have all of these constraints written out, solving for the value of a particular variable - or determining that many solutions exists - reduces to solving a system of linear equations. There are a number of techniques to do this, with Gaussian elimination with backpropagation being a particularly common way to do this (though often you'd just hand this to MATLAB or a linear algebra package and have it do the work for you.)
There's no guarantee in general that given a collection of equations the computer can determine whether or not they have a solution or to deduce a value of a variable, but this happens to be one of the nice cases where the shape of the contraints make the problem amenable to exact solutions.
Alright I have figured some things out. To answer the question as per title directly, it's possible to represent average value in programming. 1 possible way is to create a list of map data structures which store the set collection as key (eg. "a,b,c"), while the average value of the set will be the value (eg. 25).
Extract the key and split its string by comma, store into list, then multiply the average value by the size of list to get the total (eg. 25x3 and 23x2). With this, no semantic information will be lost.
As for the context to which I asked this question, the more proper description to the problem is "Given a set of average values of different combinations of variables, is it possible to find the value of each variable?" The answer to this is open. I can't figure it out, but below is an attempt in describing the logic flow if one were to code it out:
Match the lists (from Paragraph 2) against one another in all possible combinations to check if a list contains all elements in another list. If so, substract the lists (eg. abc-bc) as well as the value (eg. 75-46). If upon substracting we only have 1 variable in the collection, then we have found the value for this variable.
If there's still more than 1 variables left such as abcd - bc = ad, then store the values as a map data structure and repeat the process, till the point where the substraction count in the full iteration is 0 for all possible combinations (eg. ac can't substract bc). This is unfortunately not where it ends.
Further solutions may be found by combining the lists (eg. ac + bd = abcd) to get more possible ways to subtract and derive at the answer. When this is the case, you just don't know when to stop trying, and the list of combinations will get exponential. Maybe someone with strong related mathematical theories may be able to prove that upon a certain number of iteration, further additions are useless and hence should stop. Heck, it may even be possible that negative values are also helpful, and hence contradict what I said earlier about 'ac' can't subtract 'bd' (to get a,c,-b,-d). This will give even more combinations to compute.
People with stronger computing science foundations may try what templatetypedef has suggested.
I'm doing some statistical analysis with R software (bootstrapped Kolmogorov-Smirnov tests) of very large data sets, meaning that my p values are all incredibly small. I've Bonferroni corrected for the large number of tests that I've performed meaning that my alpha value is also very small in order to reject the null hypothesis.
The problem is, R presents me with p values of 0 in some cases where the p value is presumably so small that it cannot be presented (these are usually for the very large sample sizes). While I can happily reject the null hypothesis for these tests, the data is for publication, so I'll need to write p < ..... but I don't know what the lowest reportable values in R are?
I'm using the ks.boot function in case that matters.
Any help would be much appreciated!
.Machine$double.xmin gives you the smallest non-zero normalized floating-point number. On most systems that's 2.225074e-308. However, I don't believe this is a sensible limit.
Instead I suggest that in Matching::ks.boot you change the line
ks.boot.pval <- bbcount/nboots to
ks.boot.pval <- log(bbcount)-log(nboots) and work on the log-scale.
Edit:
You can use trace to modify the function.
Step 1: Look at the function body, to find out where to add additional code.
as.list(body(ks.boot))
You'll see that element 17 is ks.boot.pval <- bbcount/nboots, so we need to add the modified code directly after that.
Step 2: trace the function.
trace (ks.boot, quote(ks.boot.pval <- log(bbcount)-log(nboots)), at=18)
Step 3: Now you can use ks.boot and it will return the logarithm of the bootstrap p-value as ks.boot.pvalue. Note that you cannot use summary.ks.boot since it calls format.pval, which will not show you negative values.
Step 4: Use untrace(ks.boot) to remove the modifications.
I don't know whether ks.boot has methods in the packages Rmpfr or gmp but if it does, or you feel like rolling your own code, you can work with arbitrary precision and arbitrary size numbers.