full precision may not have been achieved in 'qbeta' - r

I am running R version 2.14.0 on a PC which uses Windows 7 Ultimate (Intel Core i5-2400 3GHz processor with 8.00GB ram). Let me know if other specs needed.
I am trying to simulate correlated beta distributed data. The method I am using is an extension of what is written in this paper:
http://onlinelibrary.wiley.com/doi/10.1002/asmb.901/pdf
Basically, I start by simulating multivariate normal data (using mvrnorm() function from MASS).
Then I use pnorm() to apply the probit transform to these data such that my new data vector(s) live on (0,1). And are still correlated according to the previous statement.
Then given these probit transformed data I apply the qbeta() function with certain shape1 and shape2 parameters, to get back correlated beta data with certain mean and dispersion properties.
I know other methods for generating correlated beta data exist. I am interested in why qbeta() causes this method to fail for certain "seeds". Below is the error message I get.
Warning message:
In qbeta(probit_y0, shape1 = a0, shape2 = b0) :
full precision may not have been achieved in 'qbeta'
What does this mean? How can it be avoided? When it does occur within the context of a larger simulation what is the best way to ensure that this problem does not terminate the entire sourced (using source()) simulation code?
I ran the following code for integer seeds from 1:1000. Seed=899 was the only value which gave me problems. Though if its problematic here, it inevitably will be problematic for other seeds too.
library(MASS)
set.seed(899)
n0 <- 25
n1 <- 25
a0 <- 0.25
b0 <- 4.75
a1 <- 0.25
b1 <- 4.75
varcov_mat <- matrix(rep(0.25,n0*n0),ncol=n0)
diag(varcov_mat) <- 1
y0 <- mvrnorm(1,mu=rep(0,n0),Sigma=varcov_mat)
y1 <- mvrnorm(1,mu=rep(0,n1),Sigma=varcov_mat)
probit_y0 <- pnorm(y0)
probit_y1 <- pnorm(y1)
beta_y0 <- qbeta(probit_y0, shape1=a0, shape2=b0)
beta_y1 <- qbeta(probit_y1, shape1=a1, shape2=b1)
The above code is a fragment of a piece of a larger simulation project. But the qbeta() warning message is what is giving me a headache now.
Any help the group could provide would be greatly appreciated.
Cheers
Chris

The reason for the error is that the algorithm used to calculate qbeta did not converge for those values of the parameters.
R uses AS 109 to calculate qbeta (Cran, G. W., K. J. Martin and G. E. Thomas (1977). Remark AS R19 and Algorithm AS 109, Applied Statistics, 26, 111–114, and subsequent remarks (AS83 and correction).). R tries to calculate the value in 1000 iterations. If it cannot in the 1000 iterations, you get the error message you saw.
Here is the code for qbeta.

Related

What does the summary function do to the output of regsubsets?

Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
I(temp^2)+I(gas^2)+I(pack^2)+
I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.
Ok, so you know the help page for regsubsets says
regsubsets returns an object of class "regsubsets" containing no
user-serviceable parts. It is designed to be processed by
summary.regsubsets.
You're about to find out why.
The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.
The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.
All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 + ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 + ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 + ll$intercept) * log(vr) +
i * log(n1 + ll$intercept))
}
}
cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.
regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.

MLE using nlminb in R - understand/debug certain errors

This is my first question here, so I will try to make it as well written as possible. Please be overbearing should I make a silly mistake.
Briefly, I am trying to do a maximum likelihood estimation where I need to estimate 5 parameters. The general form of the problem I want to solve is as follows: A weighted average of three copulas, each with one parameter to be estimated, where the weights are nonnegative and sum to 1 and also need to be estimated.
There are packages in R for doing MLE on single copulas or on a weighted average of copulas with fixed weights. However, to the best of my knowledge, no packages exist to directly solve the problem I outlined above. Therefore I am trying to code the problem myself. There is one particular type of error I am having trouble tracing to its source. Below I have tried to give a minimal reproducible example where only one parameter needs to be estimated.
library(copula)
set.seed(150)
x <- rCopula(100, claytonCopula(250))
# Copula density
clayton_density <- function(x, theta){
dCopula(x, claytonCopula(theta))
}
# Negative log-likelihood function
nll.clayton <- function(theta){
theta_trans <- -1 + exp(theta) # admissible theta values for Clayton copula
nll <- -sum(log(clayton_density(x, theta_trans)))
return(nll)
}
# Initial guess for optimization
guess <- function(x){
init <- rep(NA, 1)
tau.n <- cor(x[,1], x[,2], method = "kendall")
# Guess using method of moments
itau <- iTau(claytonCopula(), tau = tau.n)
# In case itau is negative, we need a conditional statement
# Use log because it is (almost) inverse of theta transformation above
if (itau <= 0) {
init[1] <- log(0.1) # Ensures positive initial guess
}
else {
init[1] <- log(itau)
}
}
estimate <- nlminb(guess(x), nll.clayton)
(parameter <- -1 + exp(estimate$par)) # Retrieve estimated parameter
fitCopula(claytonCopula(), x) # Compare with fitCopula function
This works great when simulating data with small values of the copula parameter, and gives almost exactly the same answer as fitCopula() every time.
For large values of the copula parameter, such as 250, the following error shows up when I run the line with nlminb():"Error in .local(u, copula, log, ...) : parameter is NA
Called from: .local(u, copula, log, ...)
Error during wrapup: unimplemented type (29) in 'eval'"
When I run fitCopula(), the optimization is finished, but this message pops up: "Warning message:
In dlogcdtheta(copula, u) :
dlogcdtheta() returned NaN in column(s) 1 for this explicit copula; falling back to numeric derivative for those columns"
I have been able to find out using debug() that somewhere in the optimization process of nlminb, the parameter of interest is assigned the value NaN, which then yields this error when dCopula() is called. However, I do not know at which iteration it happens, and what nlminb() is doing when it happens. I suspect that perhaps at some iteration, the objective function is evaluated at Inf/-Inf, but I do not know what nlminb() does next. Also, something similar seems to happen with fitCopula(), but the optimization is still carried out to the end, only with the abovementioned warning.
I would really appreciate any help in understanding what is going on, how I might debug it myself and/or how I can deal with the problem. As might be evident from the question, I do not have a strong background in coding. Thank you so much in advance to anyone that takes the time to consider this problem.
Update:
When I run dCopula(x, claytonCopula(-1+exp(guess(x)))) or equivalently clayton_density(x, -1+exp(guess(x))), it becomes apparent that the density evaluates to 0 at several datapoints. Unfortunately, creating pseudobservations by using x <- pobs(x) does not solve the problem, which can be see by repeating dCopula(x, claytonCopula(-1+exp(guess(x)))). The result is that when applying the logarithm function, we get several -Inf evaluations, which of course implies that the whole negative log-likelihood function evaluates to Inf, as can be seen by running nll.clayton(guess(x)). Hence, in addition to the above queries, any tips on handling log(0) when doing MLE numerically is welcome and appreciated.
Second update
Editing the second line in nll.clayton as follows seems to work okay:
nll <- -sum(log(clayton_density(x, theta_trans) + 1e-8))
However, I do not know if this is a "good" way to circumvent the problem, in the sense that it does not introduce potential for large errors (though it would surprise me if it did).

R - linear model does not match experimental data

I am trying to perform a linear regression on experimental data consisting of replicate measures of the same condition (for several conditions) to check for the reliability of the experimental data. For each condition I have ~5k-10k observations stored in a data frame df:
[1] cond1 repA cond1 repB cond2 repA cond2 repB ...
[2] 4.158660e+06 4454400.703 ...
[3] 1.458585e+06 4454400.703 ...
[4] NA 887776.392 ...
...
[5024] 9571785.382 9.679092e+06 ...
I use the following code to plot scatterplot + lm + R^2 values (stored in rdata) for the different conditions:
for (i in seq(1,13,2)){
vec <- matrix(0, nrow = nrow(df), ncol = 2)
vec[,1] <- df[,i]
vec[,2] <- df[,i+1]
vec <- na.exclude(vec)
plot(log10(vec[,1]),log10(vec[,2]), xlab = 'rep A', ylab = 'rep B' ,col="#00000033")
abline(fit<-lm(log10(vec[,2])~log10(vec[,1])), col='red')
legend("topleft",bty="n",legend=paste("R2 is",rdata[1,((i+1)/2)] <- format(summary(fit)$adj.r.squared,digits=4)))
}
However, the lm seems to be shifted so that it does not fit the trend I see in the experimental data:
It consistently occurs for every condition. I unsuccesfully tried to find an explanation by looking up the scource code and browsing different forums and posts (this or here).
Would have like to simply comment/ask a few questions, but can't.
From what I've understood, both repA and repB are measured with error. Hence, you cannot fit your data using an ordinary least square procedure, which only takes into account the error in Y (some might argue a weighted OLS may work, however I'm not skilled enough to discuss that). Your question seem linked to this one.
What you can use is a total least square procedure: it takes into account the error in X and Y. In the example below, I've used a "normal" TLS assuming there is the same error in X and Y (thus error.ratio=1). If it is not, you can specify the error ratio by entering error.ratio=var(y1)/var(x1) (at least I think it's var(Y)/var(X): check on the documentation to ensure that).
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
Please note that, in Deming/TLS regressions, your x- and y-errors are supposed to follow normal distribution, as explained here. If it's not the case, go for a Passing-Bablok regressions (and the R code is here).
Also note that the R2 isn't defined for Deming nor Passing Bablok regressions (see here). A correlation coefficient is a good proxy, although it does not exactly provide the same information. Since you're studying a linear correlation between two factors, see Pearson's product moment correlation coefficient, and use e.g. the rcorrfunction.

The curious case of ARIMA modelling using R

I observed something strange while fitting an ARMA model using the function arma{tseries} and arima{stats} in R.
There is a radical difference in the estimation procedures adopted by the two functions, Kalman filter in arima{stats} as opposed to ML estimation in arma{tseries}.
Given the difference in the estimation procedures between the two functions, one would not expect the results to be radically different for the two function if we use the same timeseries.
Well seems that they can!
Generate the below timeseries and add 2 outliers.
set.seed(1010)
ts.sim <- abs(arima.sim(list(order = c(1,0,0), ar = 0.7), n = 50))
ts.sim[8] <- ts.sim[12]*8
ts.sim[35] <- ts.sim[32]*8
Fit an ARMA model using the two function.
# Works perfectly fine
arima(ts.sim, order = c(1,0,0))
# Works perfectly fine
arma(ts.sim, order = c(1,0))
Change the level of the timeseries by a factor of 1 billion
# Introduce a multiplicative shift
ts.sim.1 <- ts.sim*1000000000
options(scipen = 999)
summary(ts.sim.1)
Fit an ARMA model using the 2 functions:
# Works perfectly fine
arma(ts.sim.1, order = c(1,0))
# Does not work
arima(ts.sim.1, order = c(1,0,0))
## Error in solve.default(res$hessian * n.used, A): system is
computationally singular: reciprocal condition number = 1.90892e-19
Where I figured out this problem was when SAS software was successfully able to run the proc x12 procedure to conduct the seasonality test but the same function on R gave me the error above. That made me really wonder and look at the SAS results with skepticism but it turn out, it might just be something to do with the arima{stats}.
Can anyone try to elaborate the reason for the above error which limits us to fit a model using arima{stats}?
The problem arises in the stats::arima function when calculating the covariance matrix of the coefficients. The code is not very robust to scale effects due to large numbers, and crashes in computing the inverse of the Hessian matrix in this line:
var <- crossprod(A, solve(res$hessian * n.used, A))
The problem is avoided by simply scaling the data. For example
arima(ts.sim.1/100, order = c(1,0,0))
will work.
The tseries::arma function does not work "perfectly fine" though. It returns a warning message:
In arma(ts.sim.1, order = c(1, 0)) : Hessian negative-semidefinite
This can also be avoided by scaling:
arma(ts.sim.1/1000, order = c(1,0))

Trying to do a simulation in R

I'm pretty new to R, so I hope you can help me!
I'm trying to do a simulation for my Bachelor's thesis, where I want to simulate how a stock evolves.
I've done the simulation in Excel, but the problem is that I can't make that large of a simulation, as the program crashes! Therefore I'm trying in R.
The stock evolves as follows (everything except $\epsilon$ consists of constants which are known):
$$W_{t+\Delta t} = W_t exp^{r \Delta t}(1+\pi(exp((\sigma \lambda -0.5\sigma^2) \Delta t+\sigma \epsilon_{t+\Delta t} \sqrt{\Delta t}-1))$$
The only thing here which is stochastic is $\epsilon$, which is represented by a Brownian motion with N(0,1).
What I've done in Excel:
Made 100 samples with a size of 40. All these samples are standard normal distributed: N(0,1).
Then these outcomes are used to calculate how the stock is affected from these (the normal distribution represent the shocks from the economy).
My problem in R:
I've used the sample function:
x <- sample(norm(0,1), 1000, T)
So I have 1000 samples, which are normally distributed. Now I don't know how to put these results into the formula I have for the evolution of my stock. Can anyone help?
Using R for (discrete) simulation
There are two aspects to your question: conceptual and coding.
Let's deal with the conceptual first, starting with the meaning of your equation:
1. Conceptual issues
The first thing to note is that your evolution equation is continuous in time, so running your simulation as described above means accepting a discretisation of the problem. Whether or not that is appropriate depends on your model and how you have obtained the evolution equation.
If you do run a discrete simulation, then the key decision you have to make is what stepsize $\Delta t$ you will use. You can explore different step-sizes to observe the effect of step-size, or you can proceed analytically and attempt to derive an appropriate step-size.
Once you have your step-size, your simulation consists of pulling new shocks (samples of your standard normal distribution), and evolving the equation iteratively until the desired time has elapsed. The final state $W_t$ is then available for you to analyse however you wish. (If you retain all of the $W_t$, you have a distribution of the trajectory of the system as well, which you can analyse.)
So:
your $x$ are a sampled distribution of your shocks, i.e. they are $\epsilon_t=0$.
To simulate the evolution of the $W_t$, you will need some initial condition $W_0$. What this is depends on what you're modelling. If you're modelling the likely values of a single stock starting at an initial price $W_0$, then your initial state is a 1000 element vector with constant value.
Now evaluate your equation, plugging in all your constants, $W_0$, and your initial shocks $\epsilon_0 = x$ to get the distribution of prices $W_1$.
Repeat: sample $x$ again -- this is now $\epsilon_1$. Plugging this in, gives you $W_2$ etc.
2. Coding the simulation (simple example)
One of the useful features of R is that most operators work element-wise over vectors.
So you can pretty much type in your equation more or less as it is.
I've made a few assumptions about the parameters in your equation, and I've ignored the $\pi$ function -- you can add that in later.
So you end up with code that looks something like this:
dt <- 0.5 # step-size
r <- 1 # parameters
lambda <- 1
sigma <- 1 # std deviation
w0 <- rep(1,1000) # presumed initial condition -- prices start at 1
# Show an example iteration -- incorporate into one line for production code...
x <- rnorm(1000,mean=0,sd=1) # random shock
w1 <- w0*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*x*sqrt(dt) -1)) # evolution
When you're ready to let the simulation run, then merge the last two lines, i.e. include the sampling statement in the evolution statement. You then get one line of code which you can run manually or embed into a loop, along with any other analysis you want to run.
# General simulation step
w <- w*exp(r*dt)*(1+exp((sigma*lambda-0.5*sigma^2)*dt +
sigma*rnorm(1000,mean=0,sd=1)*sqrt(dt) -1))
You can also easily visualise the changes and obtain summary statistics (5-number summary):
hist(w)
summary(w)
Of course, you'll still need to work through the details of what you actually want to model and how you want to go about analysing it --- and you've got the $\pi$ function to deal with --- but this should get you started toward using R for discrete simulation.

Resources