Linear Dependence errors in DiceKriging with linearly independent data - r

The Problem (with minimal working example)
Dicekriging gives linear dependence errors about half the time when data is close to being linearly dependent. This can be seen by the below example which gave an error about half the time when I ran it on both an Ubuntu and windows computer. This occurs when I run it with either genetic or BFGS optimisation.
install.packages("DiceKriging")
library(DiceKriging)
x = data.frame(Param1 = c(0,1,2,2,2,2), Param2 = c(2,0,1,1e-7,0,2))
y = 1:6
duplicated(cbind(x,y))
Model = km( design = x, response = y , optim.method = "gen", control = list(maxit = 1e4), coef.cov = 1)
Model = km( design = x, response = y , optim.method = "BFGS", control = list(maxit = 1e4), coef.cov = 1)
When the data is a a little more dispersed no such errors occur.
# No problems occur if the data is more dispersed.
x = data.frame(Param1 = c(0,1,2,2,2,2), Param2 = c(2,0,1,1e-2,0,2))
y = 1:6
duplicated(cbind(x,y))
Model = km( design = x, response = y , optim.method = "gen", control = list(maxit = 1e4), coef.cov = 1)
Why this is a problem
Using Kriging for optimization of expensive models means that points near the optima will be densely sampled. It is not possible to do this with this error occuring. In addition the closeness points need to be to get this error can be closer than the 1e-7 above when there are multiple parameters that are all close. I got errors (in my actual problem, not MWE above) when the 4 coordinates of a point were around 1e-3 apart from another point and this problem occurred.
Related Questions
There are not many DiceKriging questions on stack overflow. The closest question is this one (from the Kriging package ) in which the problem is genuine linear dependence. Note the Kriging package is not a substitute for DiceKriging in that it is restricted to 2 dimensions.
Desired Solution
I would like either:
A way to change my km call to avoid this problem (preferred)
A way to determine when this problem will occur so that I can drop observations that are too close to each other for the kriging call.

Your problem is not a software problem. It's rather a mathematical one.
Your first data contains the two following points (0 , 2) and (1e-7, 2) that are very very close but correspond to (very) different outputs: 4 and 5. Therefore, you are trying to adjust a Kriging model, which is an interpolation model, to a chaotic response. This cannot work. Kriging/Gaussian process modelling is not the good tool if your response varies a lot between points which are close.
However, when you are optimizing expensive models, things are not like on your example. There is not such a difference in the response for very close input points.
But, there could be indeed numerical problem if your points are very close.
In order to soften these numerical errors, you can add a nugget effect. The nugget is a small constant variance added to the diagonal of the covariance matrix, which allows the points not to be exactly interpolated. Your kriging approximation curve is not forced to pass exactly through the learning points. Thus, the kriging model becomes a regression model.
In DiceKriging, the nugget can be added in two ways. First, you can choose a priori a value and add it "manually" by setting km(..., nugget=you_value), or you can ask km to learn it at the same time it learns the parameters of the covariance function, by setting km(..., nugget.estim=TRUE). I advise you to use the second in general.
Your small example becomes then:
Model = km( design = x, response = y , optim.method = "BFGS", control = list(maxit = 1e4),
coef.cov = 1, nugget.estim=TRUE)
Extract from the DiceKriging documentation:
nugget: an optional variance value standing for the homogeneous nugget
effect.
nugget.estim: an optional boolean indicating whether the nugget
effect should be estimated. Note that this option does not concern the
case of heterogeneous noisy observations (see noise.var below). If
nugget is given, it is used as an initial value. Default is FALSE.
PS: The choice of the covariance kernel can be important in some applications. When the function to approximate is quite rough, the exponential kernel (set covtype="exp" in km) can be a good choice. See the book of Rasmussen and Williams for more information (freely and legaly available at http://www.gaussianprocess.org/)

Related

Would nonidentifiability create an inconsistent response from optim in R?

My objective is to use a kinetic model to describe reaction data. The application is for a fuel and the model is widely accepted as one of the more accurate ones given the setup of my problem. I may have a nonidentifiability issue, but it bothers me that the response from optim is given such an inconsistent response.
Take the two graphs, , I have picked that point based on its low squared error. The second is what optim selected (I don't have enough rep for picture 2, I will try to post comment, hint hint, its not close to lining up). When I ran the numbers that optim gave me it did not match the expected response. I wish I could paste the exact values, but the optimization takes more than two hours each run so I have been tuning it as much as I can with the time I can get. I can say that R is settling on the boundaries. The bounds are set to physical limits at room temperature one can obtain from the pure compound (i.e. the molarity at room temperature). I can be flexible, but not too much as the point of the project was to limit the model parameters to observed physical parameters.
This is all to prep it for an MCMC to add Bayesian elements to this. If my first guess is junk so is the whole project.
To sum it up I would like to know why the errors are inconsistent and if it is coming from nonidentifiability if improving the initial guess can fix that or if I need to remove a variable.
Code for reference.
Objective function
init = function(param){
#Solve for displacement of triglycerides
T.mcmc2 = T.hat.isf
T.mcmc2 = T.mcmc2 - min(T.mcmc2)
A.mcmc2 = T.mcmc2
A.mcmc2[1] <- (6*1.02*.200)/.250
B.mcmc2 = T.mcmc2
primes <- Bprime(x.fine1, param, T.mcmc2, A.mcmc2, B.mcmc2)
B.mcmc2 <- as.numeric(unlist(primes[1]))
A.mcmc2 <- as.numeric(unlist(primes[2]))
res = t(B.obs-B.mcmc2[x.points])%*%(B.obs-B.mcmc2[x.points])
#print(res)
return (res)
}
Optimization with parameters
l = c(1e-8,1e-8, 1e-8, 1e-8)
u = c(2,1.2,24,24)
th0=c(.052, 0.19, .50, 8)
op = optim(th0[1:3], init, method="L-BFGS-B", lower=l, upper = u)
Once run, this message often occurs "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

How to use survey to analyze the American Housing Survey data using replicate weights

I'm analyzing data from the American Housing Survey, which ship with replicate weights to compute correct standard errors, in R with survey, but I want to make sure that I'm specifying the design correctly.
Here is how I do it:
svy <- svrepdesign(data = ahs,
weight = ~WEIGHT,
repweights = "REPWEIGHT[0-9]+",
type = "Fay",
rho = 0.5,
scale = 4/160,
rscales = rep(1, 160),
mse = TRUE)
I set rho to 0.5 because, in in section 3.1 of the guide to use replicate weights published by the Census Bureau where they explain how to compute standard errors with SAS (https://www.census.gov/content/dam/Census/programs-surveys/ahs/tech-documentation/2015/Quick%20Guide%20to%20Estimating%20Variance%20Using%20Replicate%20Weights%202009%20to%20Current.pdf), they say to use the option VARMETHOD=BRR(FAY) without specifying any other options and, according to the SAS documentation (http://support.sas.com/documentation/onlinedoc/stat/142/surveymeans.pdf), the default value for this parameter is 0.5.
I set mse to TRUE because, in the formula they give for the standard error in section 4, the sum of squared deviations is calculated around the estimate of the statistic computed with the full sample weights.
Finally, I set scaleto 4/160 and rscalesto rep(1, 160) because, in that same formula, the sum of squared deviations is multiplied by 4/160 but there is no multiplier inside the sum operator.
However, when I look at Anthony Joseph Damico's webpage on the American Housing Survey (http://asdfree.com/american-housing-survey-ahs.html), he does that:
ahs_design <-
svrepdesign(
weights = ~ wgt90geo ,
repweights = "repwgt[1-9]" ,
type = "Fay" ,
rho = ( 1 - 1 / sqrt( 4 ) ) ,
mse = TRUE ,
data = ahs_df
)
Forget about the names of the weight variables, which just changed in 2015 (presumably after he wrote that webpage), he's doing the same as me except that he doesn't specify the scale and rscales. Based on what I explain above and the documentation of survey, it seems to me that he should specify them as I did, but I've never used replicate weights with survey before, so I would like to make sure.
P. S. What I find even weirder is that, when I try not to specify scale and rscales, the standard errors I compute seem to be the same as when I do. This means that it probably doesn't matter in practice how I do it, but since the formula used to compute the standard errors is supposed to be different if I specify scale and rscales, I would still like to understand why it doesn't seem to affect the standard errors that are computed by survey.
P. S. bis: Another thing I don't understand is that, even though the Census Bureau says it has used Fay's method and recommend to use a SAS procedure that will result in a Fay coefficient of 0.5, there doesn't seem to be any Fay coefficient in the formula for the standard error given in the guide it published. This means that, if I were to write my own code to compute standard errors using that formula, the result would presumably be different than when I use survey with a rho of 0.5 or the SAS procedure recommended by the Census Bureau to compute standard errors, which doesn't make a lot of sense to me.
svrepdesign doesn't need scale or rscales arguments for Fay replicate weights, because it can work them out by itself. That's the point of having known types of weights. I should probably add a warning for when you specify them anyway.
There doesn't need to be a Fay coefficient in the formula explicitly. When the weights were constructed, the sampling weights were multiplied by 2-rho or rho to get replicate weights. That's all been done. Now all you need is to know how to scale the squared residuals. The Census Bureau formula (p6 of your link) has a multiplier of 4/160. That 4 is 1/(1-rho)^2 -- Anthony Damico's code has the reverse conversion, working out rho=0.5 from the 4.
Straightforward BRR would have a multiplier of 1/160 rather than 4/160.

Fitting Exponential Distribution to Task Duration Counts

In my dataset, I have ants that switch between one state (in this case a resting state) and all other states over a period of time. I am attempting to fit an exponential distribution to the number of times an ant spends in a resting state for some duration of time (for instance, the ant may rest for 5 seconds 10 times, or it could rest for 6 seconds 5 times, etc.). While subjectively this distribution of durations seems to be exponential, I can't fit a single parameter exponential distribution (where the one parameter is rate) to the data. Is this possible to do with my dataset, or do I need to use a two parameter exponential distribution?
I am attempting to fit the data to the following equation (where lambda is rate):
lambda * exp(-lambda * x).
This, however, doesn't seem to be mathematically possible to fit to either the counts of my data or the probability density of my data. In R I attempt to fit the data with the following code:
fit = nls(newdata$x.counts ~ (b*exp(b*newdata$x.mids)), start =
list(x.counts = 1, x.mids = 1, b = 1))
When I do this, though, I get the following message:
Error in parse(text= x, keep.source = FALSE):
<text>:2:0: unexpected end of input
1: ~
^
I believe I am getting this because its mathematically impossible to fit this particular equation to my data. Am I correct in this, or is there a way to transform the data or alter the equation so I can make it fit? I can also make it fit with the equation lambda * exp(mu * x) where mu is another free parameter, but my goal is to make this equation as simple as possible, so I would prefer to use the one parameter version.
Here is the data, as I can't seem to find a way to attach it as a csv:
https://docs.google.com/spreadsheets/d/1euqdgHfHoDmQKXHrtOLcn5x5o81zY1sr9Kq6NCbisYE/edit?usp=sharing
First, you have a typo in your formula, you forgot the - sign in
(b*exp(b*newdata$x.mids))
But this is not what is throwing the error. The start parameter should be a list that initializes only the parameter value, not x.counts nor x.mids.
So the correct version would be:
fit = nls(newdata$x.counts ~ b*exp(-b*newdata$x.mids), start = list(b = 1))

R: Using fitdistrplus to fit curve over histogram of discrete data

So I have this discrete set of data my_dat that I am trying to fit a curve over to be able to generate random variables based on my_dat. I had great success using fitdistrplus on continuous data but have many errors when attempting to use it for discrete data.
Table settings:
library(fitdistrplus)
my_dat <- c(2,5,3,3,3,1,1,2,4,6,
3,2,2,8,3,4,3,3,4,4,
2,1,5,3,1,2,2,4,3,4,
2,4,1,6,2,3,2,1,2,4,
5,1,2,3,2)
I take a look at the histogram of the data first:
hist(my_dat)
Since the data's discrete, I decide to try a binomial distribution or the negative binomial distribution to fit and this is where I run into trouble: Here I try to define each:
fitNB3 <- fitdist(my_dat, discrete = T, distr = "nbinom" ) #NaNs Produced
fitB3 <- fitdist(my_dat, discrete = T, distr = "binom")
I receive two errors:
fitNB3 seems to run but notes that "NaNs Produced" - can anyone let me
know why this is the case?
fitB3 doesn't run at all and provides me with the error: "Error in start.arg.default(data10, distr = distname) : Unknown starting values for distribution binom." - can anyone point out why this won't work here? I am unclear about providing a starting number given that the data is discrete (I attempted to use start = 1 in the fitdist function but I received another error: "Error in fitdist(my_dat, discrete = T, distr = "binom", start = 1) : the function mle failed to estimate the parameters, with the error code 100"
I've been spinning my wheels for a while on this but I would be take any feedback regarding these errors.
Don't use hist on discrete data, because it doesn't do what you think it's doing.
Compare plot(table(my_dat)) with hist(my_dat)... and then ponder how many wrong impressions you've gotten doing this before. If you must use hist, make sure you specify the breaks, don't rely on defaults designed for continuous variables.
hist(my_dat)
lines(table(my_dat),col=4,lwd=6,lend=1)
Neither of your models can be suitable as both these distributions start from 0, not 1, and with the size of values you have, p(0) will not be ignorably small.
I don't get any errors fitting the negative binomial when I run your code.
The issue you had with fitting the binomial is you need to supply starting values for the parameters, which are called size (n) and prob (p), so
you'd need to say something like:
fitdist(my_dat, distr = "binom", start=list(size=15, prob=0.2))
However, you will then get a new problem! The optimizer assumes that the parameters are continuous and will fail on size.
On the other hand this is probably a good thing because with unknown n MLE is not well behaved, particularly when p is small.
Typically, with the binomial it would be expected that you know n. In that case, estimation of p could be done as follows:
fitdist(my_dat, distr = "binom", fix.arg=list(size=20), start=list(prob=0.15))
However, with fixed n, maximum likelihood estimation is straightforward in any case -- you don't need an optimizer for that.
If you really don't know n, there are a number of better-behaved estimators than the MLE to be found, but that's outside the scope of this question.

How to fit model with individual measurement error in DiceKriging, or can it?

I have a set of 5 data points (x=10,20,30,40,50 and its corresponding response values y and noise as s.d. of y). These data are obtained from stochastic computer experiments.
How can I use DiceKriging in R to fit a kriging model for these data?
x <- seq(from=10, to=50, length=5)
y <- c(-0.071476,0.17683,0.19758,0.2642,0.4962)
noise <- c(0.009725,0.01432,0.03284, 0.1038, 0.1887)
Examples online with heterogeneous noise are pre-specified with coef.var, coef.trend and coef.theta. It is unlikely that I can have a priori on these.
I have referred to the answer here. However, other references suggest adding the nugget parameter lambda is similar to adding homogeneous noise, which is not likely "individual errors".
The use of km with noise is quite simple:
model <- km(~1, data.frame(x=x), y, noise.var = noise, covtype = "matern3_2")
However, your noise term make the line search part of L-BFGS algorithm fail. It may be due to the fact that is is strongly correlated with y, because when I run the following lines, it works:
noice <- c(0.009725,0.01432,0.03284, 0.001, 0.1887)
model <- km(~1, data.frame(x=x), y, noise.var = noise, covtype = "matern3_2")

Resources