How to solve ODEs with an internal threshold? - r

I have the following function containing some odes:
myfunction <- function(t, state, parameters) {
with(as.list(c(state, parameters)),{
if (X>20) { # this is an internal threshold!
Y <- 35000
dY <- 0
}else{
dY <- b * (Y-Z)
}
dX <- a*X^6 + Y*Z
dZ <- -X*Y + c*Y - Z
# return the rate of change
list(c(dX, dY, dZ),Y,dY)
})
}
Here are some results:
library(deSolve)
parameters <- c(a = -8/3, b = -10, c = 28)
state <- c(X = 1, Y = 1, Z = 1)
times <- seq(0, 10, by = 0.1)
out <- ode(y = state, times = times, func = myfunction, parms = parameters)
out
time X Y Z Y dY
1 0.0 1.000000 1.000000 1.000000 1.000000 0.00000
2 0.1 1.104670 2.132728 4.470145 2.132728 23.37417
3 0.2 1.783117 6.598806 14.086158 6.598806 74.87351
4 0.3 2.620428 20.325966 42.957134 20.325966 226.31169
5 0.4 3.775597 60.969424 126.920014 60.969424 659.50590
6 0.5 5.358823 176.094907 358.726482 176.094907 1826.31575
7 0.6 7.460841 482.506706 953.270570 482.506706 4707.63864
8 0.7 10.122371 1230.831764 2330.599161 1230.831764 10997.67398
9 0.8 13.279052 2859.284114 5113.458479 2859.284114 22541.74365
10 0.9 16.711405 5912.675147 9823.406760 5912.675147 39107.31613
11 1.0 24.452867 10590.600567 16288.435139 35000.000000 0.00000
12 1.1 25.988924 10590.600567 23476.343542 35000.000000 0.00000
13 1.2 26.572411 10590.600567 26821.703961 35000.000000 0.00000
14 1.3 26.844240 10590.600567 28510.668725 35000.000000 0.00000
15 1.4 26.980647 10590.600567 29391.032472 35000.000000 0.00000
...
States Y are different, can anybody explain me why please?
I believe I haven't set my threshold correctly.
Is there a way to that?
Thanks!

Think the simplest method to solve ODEs, i.e. Euler method:
state = state+myfunction(t,state,parameters)*h
f(t+h)=f(t) + f'(t) *h
h is a small time step, myfunction is the f'(t) derivative of f(t) and only evaluates the derivative, it does not have access to the actual state nor Y. Both are set internally in ode using a method which in principle is similar to Euler's: given the numerical values of f(t),f'(t),h it just updates state f(t+h).
So the threshold adjusts dY but cannot access state["Y"]. The process just manipulates a local variable which is evaluated as 35000 in dX <- a*X^6 + Y*Z and dZ <- -X*Y + c*Y - Z but the actual state["Y"] is overwritten after the myfuction has returned inside the ode function.
I am afraid that I cannot think of a simple way to bypass this design. I would just use out[5].

Related

Inner-loop behavior of a parallelized structure using R

I want to understand how my parallelization is working when there is a for-loop structure inside of the structure that I am parallelizing.
I have a routine called reg_simulation(), which generated 100 estimations (nrep=100) of linear regression, each of those using a different seed (seed <- seed + i).
Additionally, I wrapped up the reg_simulation() routine inside par_wrapper() to run it using different possible configurations of the data generating process. In particular, changing the number of observations (obs) and the error term variance (sigma). Finally, I parallelized this structure using pblapply.
Using the described setup, I am using a grid of obs = c(250, 500, 750, 1000, 2500) and
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) meaning 5 values in each variable, leading to a 25 combinations of the two variables. However, I am running 100 times these 25 combinations.
Finally, here is my question:
My code is...
(a) Running in parallel 25 combinations but serially the 100 repetition inside of them.
(b) Running in parallel all the 2500 models.
If the answer is (a), please let me know how you arrived at such a conclusion because I haven't been sorted out yet, and probably it might imply that I should change my code structure.
Some additional comments: (1) The seed declaration on each iteration is important because it allows me to recover each possible combination of the data (e.g., iteration 78 (seed = 78), with sigma=0.1 and obs=1000) (2) I am using pblapply because I want to track my code simulations' progress.
Here the aforementioned routines:
reg_simulation()
reg_simulation<- function(obs = 1000,
sigma = 0.5,
nrep = 10 ,
seed = 0){
#seet seed
res <- vector("list", nrep)
# Forloop
for ( i in 1:nrep) {
#Changing seed each iteration
seed <- seed + i
#set seed
set.seed(seed)
#DGP
x1 <- rnorm(obs, 0 , sigma)
x2 <- rnorm(obs, 0 , sigma)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
returnlist <- list(intercept = ols$coefficients[1],
beta1 = ols$coefficients[2],
beta2 = ols$coefficients[3],
seed = seed)
#save each iteration
res[[i]] <- returnlist
}
return(res)
}
par_wrapper()
### parallel wrapper
par_wrapper <- function(obs = c(250,500,750,1000,2500),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 10,
nClusters = 4)
{
require(parallel)
require(pbapply)
#grid of searching space
prs <- expand.grid(obs = obs,
sigma = sigma)
nprs <- nrow(prs)
rownames(prs) <- c(1:NROW(prs))
#Print number of combinations
print(prs)
#### ---- PARALLEL INIT ---- ####
## Parallel options
cl <- makeCluster(nClusters)
## Attaching necessary functions for internal computations
parallel::clusterExport(cl= cl,
list("reg_simulation"))
# pblapply
par_simres <- pblapply(cl = cl,
X = 1:nprs,
FUN = function(i){
reg_simulation(
sigma = prs$sigma[i],
obs = prs$obs[i],
nrep = nrep,
seed = 0)})
##exit cluster mode
stopCluster(cl)
return(par_simres)
}
Using the par_wrapper() function over a grid.
#using generated structure.
res_list <- par_wrapper(
obs = c(250,500,750,1000, 2500 ),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 100,
nClusters = 4)
Console output.
# obs sigma
# 1 250 0.1
# 2 500 0.1
# 3 750 0.1
# 4 1000 0.1
# 5 2500 0.1
# 6 250 0.2
# 7 500 0.2
# 8 750 0.2
# 9 1000 0.2
# 10 2500 0.2
# 11 250 0.5
# 12 500 0.5
# 13 750 0.5
# 14 1000 0.5
# 15 2500 0.5
# 16 250 0.8
# 17 500 0.8
# 18 750 0.8
# 19 1000 0.8
# 20 2500 0.8
# 21 250 1.0
# 22 500 1.0
# 23 750 1.0
# 24 1000 1.0
# 25 2500 1.0
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s

Using nparcomp with user-defined contrasts

I would like to test the hypothesis H0: a2 - a1 = b2 - b1 on the following data:
data.csv
a1,a2,b1,b2
0.439,0.066,0.0,0.001
0.451,0.07,0.0,0.0
0.446,0.06,0.0,0.0
0.34,0.056,0.0,0.0
0.294,0.008,0.0,0.0
0.284,0.002,0.001,0.0
1.0,1.0,0.002,0.0
First, I tried ANOVA:
library(car)
data = read.csv('data.csv')
labels = factor(rep(c('a1','a2','b1','b2'),
c(nrow(data['a1']), nrow(data['a2']), nrow(data['b1']), nrow(data['b2']))))
x = C(labels, c(-1, 1, 1, -1), 1)
y = c(data[['a1']], data[['a2']], data[['b1']], data[['b2']])
l = lm(y ~ x)
a = Anova(l)
print(a$`Pr(>F)`)
which gives p = 0.1891837.
However, since I cannot assume that the data comes from a normal distribution, I would like to use a non-parametric test instead. I tried using nparcomp with a user-defined contrast matrix:
library(nparcomp)
data = read.csv('data.csv')
x = factor(rep(c('a1','a2','b1','b2'),
c(nrow(data['a1']), nrow(data['a2']), nrow(data['b1']), nrow(data['b2']))))
y = c(data[['a1']], data[['a2']], data[['b1']], data[['b2']])
nparcomp(y ~ x, data=data.frame(x, y), type="UserDefined", contrast.matrix=t(matrix(c(-1, 1, 1, -1))))
This failed with the error:
Error in nparcomp(y ~ x, data = data.frame(x, y), type =
"UserDefined", : Sums of positive contrast coefficients must be 1!
which I fixed by rescaling the contrast matrix to t(matrix(c(-0.5, 0.5, 0.5, -0.5))). However, this failed with another error:
Error in logit.dev %*% cov.bf : non-conformable arguments
What is the correct way to use nparcomp with user-defined contrasts?
P.S. I am a newcomer to R, so apologies for the bad coding style.
Let me preface this by saying that I'm not familiar with this particular area of statistics at all, so although my code does run without errors, my modification may have completely changed the design of what you are trying to test - let me know if this is the case. Anyways, I looked through the definition of nparcomp to see what was causing the first error, Sums of positive contrast coefficients must be 1! -
if (type == "UserDefined") {
if (is.null(contrast.matrix)) {
stop("Please eanter a contrast matrix!")
}
Con <- contrast.matrix
rownames(Con) <- paste("C", 1:nrow(Con))
for (rc in 1:nrow(Con)) {
if (sum(Con[rc, ][Con[rc, ] > 0]) != 1) {
stop("Sums of positive contrast coefficients must be 1!")
}
}
colnames(Con) <- fl
}
In the case of your original contrast.matrix of t(matrix(c(-1, 1, 1, -1))) this evaluates to 2; whereas your second attempt of t(matrix(c(-0.5, 0.5, 0.5, -0.5))) clearly evaluates to 1 in the above block of code (since it doesn't produce this error).
Your second error, Error in logit.dev %*% cov.bf : non-conformable arguments, seems to be arising from trying to multiply matrices of incompatible dimensions. I didn't trace the root of this particular error (the body of this function is pretty lengthy), but it gave me the idea to try passing t(matrix(c(-0.5, 0.5, 0.5, -0.5))) as a symmetric matrix:
contrVec <- t(matrix(c(-1, 1, 1, -1)))
contrMat <- -1*(matrix(
outer(contrVec,
t(contrVec)),
ncol=4)/2)
##
> contrMat
[,1] [,2] [,3] [,4]
[1,] -0.5 0.5 0.5 -0.5
[2,] 0.5 -0.5 -0.5 0.5
[3,] 0.5 -0.5 -0.5 0.5
[4,] -0.5 0.5 0.5 -0.5
Passing this to contrast.matrix does not produce any errors, but as I've said, I am unfamiliar with this particular procedure / function, so I may have completely changed the hypothesis you are trying to test. Here's the full code:
library(car)
library(nparcomp)
##
data <- data.frame(
a1=c(.439,.451,.446,
.340,.294,.284,1.00),
a2=c(.066,.07,.06,.056,
.008,.002,1.00),
b1=c(rep(0.0,5),
.001,.002),
b2=c(.001,
rep(0.0,6)))
##
x <- factor(
rep(names(data),
each=nrow(data)))
##
y <- c(data[['a1']], data[['a2']],
data[['b1']], data[['b2']])
##
contrVec <- t(matrix(c(-1, 1, 1, -1)))
contrMat <- -1*(matrix(
outer(contrVec,
t(contrVec)),
ncol=4)/2)
##
nprce <- nparcomp(
y ~ x,
data=data.frame(x, y),
type="UserDefined",
contrast.matrix=contrMat)
##
> summary(nprce)
#------------Nonparametric Multiple Comparisons for relative contrast effects----------#
- Alternative Hypothesis: True relative contrast effect p is less or equal than 1/2
- Estimation Method: Global Pseudo ranks
- Type of Contrast : UserDefined
- Confidence Level: 95 %
- Method = Logit - Transformation
- Estimation Method: Pairwise rankings
#---------------------------Interpretation--------------------------------------------#
p(a,b) > 1/2 : b tends to be larger than a
#-------------------------------------------------------------------------------------#
#----Data Info-------------------------------------------------------------------------#
Sample Size
1 a1 7
2 a2 7
3 b1 7
4 b2 7
#----Contrast--------------------------------------------------------------------------#
a1 a2 b1 b2
C 1 -0.5 0.5 0.5 -0.5
C 2 0.5 -0.5 -0.5 0.5
C 3 0.5 -0.5 -0.5 0.5
C 4 -0.5 0.5 0.5 -0.5
#----Analysis--------------------------------------------------------------------------#
Comparison Estimator Lower Upper Statistic p.Value
1 C 1 0.429 0.345 0.517 -1.593593 0.1110273
2 C 2 0.571 0.483 0.655 1.593593 0.1110273
3 C 3 0.571 0.483 0.655 1.593593 0.1110273
4 C 4 0.429 0.345 0.517 -1.593593 0.1110273
#----Overall---------------------------------------------------------------------------#
Quantile p.Value
1 1.959966 0.1110273
#--------------------------------------------------------------------------------------#

Passing Fixed and Variable parameters to Optimx

This is a syntax question and probably has a simple solution but I can't find it covered anywhere on SO for the optimx package.
Minimal working example & Question
I have a function like:
ToOptimise = function(a,b,d,e){
(a-1)^2 + (b-2)^2 + (d-3)^2 +(e-4)^2
}
I can optimise select parameters in mle2 quite easily:
library(bbmle)
Calib2 = mle2(ToOptimise,
start = list(a = 1, d = 10),
fixed = list(b = 2, e = 2))
This is not really the right tool though as it is not a maximum likelihood problem.
I want to find a way to pass fixed and optimisable parameters to Optimx as easily
I plan to do several calibrations holding different parameters fixed at any time and so do not want to have to redo the function hardcoding in some parameters.
My attempt that don't work
library(optimx)
ToOptimiseVector = function(Theta){
a = Theta[1]
b = Theta[2]
d = Theta[3]
e = Theta[4]
(a-1)^2 + (b-2)^2 + (d-3)^2 +(e-4)^2
}
Calib1 = optimx( par = c(1,2,1,2),
fn = ToOptimiseVector,
Theta[2] = 2, Theta[4] = 2)
Another related Question on Stack Overflow
Another optimx syntax question that does not look at fixed and optimisable parameters is:
R- Optimx for exponential function with 2 parameters - cannot evaluate function at initial parameter values
Added a opt parameter to the fn. Just pass this opt vector into optimx(). Any non NA values will become fixed.
ToOptimiseVector <- function(Theta, opt = rep(NA, 4)){
# Check if any of opt is not NA
if (any(!sapply(opt, is.na))) {
i = !sapply(opt, is.na)
# Fix non-NA values
Theta[i] <- opt[i]
}
a <- Theta[1]
b <- Theta[2]
d <- Theta[3]
e <- Theta[4]
return((a-1)^2 + (b-2)^2 + (d-3)^2 +(e-4)^2)
}
Seems to work.
Calib1 = optimx( par = c(1,2,1,2), fn = ToOptimiseVector, opt = c(NA, 2, NA, 2))
Calib2 = optimx( par = c(1,2,1,2), fn = ToOptimiseVector)
> Calib1
p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2 xtimes
Nelder-Mead 0.9998974 5.517528 3.00022 10.83214 4 103 NA NA 0 TRUE FALSE 0.02
BFGS 1.0000000 4.000000 3.00000 8.00000 4 6 3 NA 0 TRUE FALSE 0.00
> Calib2
p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2 xtimes
Nelder-Mead 1.000936 1.999793 3.0006 4.000256 1.344336e-06 227 NA NA 0 FALSE TRUE 0.01
BFGS 1.000000 2.000000 3.0000 4.000000 3.566556e-23 16 3 NA 0 TRUE TRUE 0.00
The trick is that the starting params and the arguments to the function to be optimized have to be aligned. Please see if the following helps you at all.
library(optimx)
ToOptimiseVector <- function(Theta){
a <- Theta[1]
b <- Theta[2]
d <- Theta[3]
e <- Theta[4]
(a-1)^2 + (b-2)^2 + (d-3)^2 +(e-4)^2
}
start <- c(1,0,1,1)
start <- c(1,0,0,0)
start <- c(1,2,1,2)
Calib1 <- optimx( par=start,
fn = ToOptimiseVector)
This gives warning messages, but you do get a result.
> Calib1
p1 p2 p3 p4 value fevals gevals niter convcode kkt1 kkt2
Nelder-Mead 1 2 3 4 0 4 4 NA 0 TRUE TRUE
BFGS 1 2 3 4 0 4 4 NA 0 TRUE TRUE
xtimes
Nelder-Mead 0
BFGS 0
Hope you can start with this and proceed.

Nonlinear Least Squares in R - Levenberg Marquardt to Fit Heligman Pollard Model Parameters

I am attempting to reproduce the solutions of paper by Kostakis. In this paper an abridged mortality table is expanded to a complete life table using de Heligman-Pollard model. The model has 8 parameters which have to be fitted. The author used a modified Gauss-Newton algorithm; this algorithm (E04FDF) is part of the NAG library of computer programs. Should not Levenberg Marquardt yield the same set of parameters? What is wrong with my code or application of the LM algorithm?
library(minpack.lm)
## Heligman-Pollard is used to expand an abridged table.
## nonlinear least squares algorithm is used to fit the parameters on nqx observed over 5 year intervals (5qx)
AGE <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)
MORTALITY <- c(0.010384069, 0.001469140, 0.001309318, 0.003814265, 0.005378395, 0.005985625, 0.006741766, 0.009325056, 0.014149626, 0.021601755, 0.034271934, 0.053836246, 0.085287751, 0.136549522, 0.215953304)
## The start parameters for de Heligman-Pollard Formula (Converged set a=0.0005893,b=0.0043836,c=0.0828424,d=0.000706,e=9.927863,f=22.197312,g=0.00004948,h=1.10003)
## I modified a random parameter "a" in order to have a start values. The converged set is listed above.
parStart <- list(a=0.0008893,b=0.0043836,c=0.0828424,d=0.000706,e=9.927863,f=22.197312,g=0.00004948,h=1.10003)
## The Heligman-Pollard Formula (HP8) = qx/px = ...8 parameter equation
HP8 <-function(parS,x)
ifelse(x==0, parS$a^((x+parS$b)^parS$c) + parS$g*parS$h^x,
parS$a^((x+parS$b)^parS$c) + parS$d*exp(-parS$e*(log(x/parS$f))^2) +
parS$g*parS$h^x)
## Define qx = HP8/(1+HP8)
qxPred <- function(parS,x) HP8(parS,x)/(1+HP8(parS,x))
## Calculate nqx predicted by HP8 model (nqxPred(parStart,x))
nqxPred <- function(parS,x)
(1 -(1-qxPred(parS,x)) * (1-qxPred(parS,x+1)) *
(1-qxPred(parS,x+2)) * (1-qxPred(parS,x+3)) *
(1-qxPred(parS,x+4)))
##Define Residual Function, the relative squared distance is minimized
ResidFun <- function(parS, Observed,x) (nqxPred(parS,x)/Observed-1)^2
## Applying the nls.lm algo.
nls.out <- nls.lm(par=parStart, fn = ResidFun, Observed = MORTALITY, x = AGE,
control = nls.lm.control(nprint=1,
ftol = .Machine$double.eps,
ptol = .Machine$double.eps,
maxfev=10000, maxiter = 500))
summary(nls.out)
## The author used a modified Gauss-Newton algorithm, this alogorithm (E04FDF) is part of the NAG library of computer programs
## Should not Levenberg Marquardt yield the same set of parameters
The bottom line here is that #Roland is absolutely right, this is a very ill-posed problem, and you shouldn't necessarily expect to get reliable answers. Below I've
cleaned up the code in a few small ways (this is just aesthetic)
changed the ResidFun to return residuals, not squared residuals. (The former is correct, but this doesn't make very much difference.)
explored results from several different optimizers. It actually looks like the answer you're getting is better than the "converged parameters" you list above, which I'm assuming are the parameters from the original study (can you please provide a reference?).
Load package:
library(minpack.lm)
Data, as a data frame:
d <- data.frame(
AGE = seq(0,70,by=5),
MORTALITY=c(0.010384069, 0.001469140, 0.001309318, 0.003814265,
0.005378395, 0.005985625, 0.006741766, 0.009325056,
0.014149626, 0.021601755, 0.034271934, 0.053836246,
0.085287751, 0.136549522, 0.215953304))
First view of the data:
library(ggplot2)
(g1 <- ggplot(d,aes(AGE,MORTALITY))+geom_point())
g1+geom_smooth() ## with loess fit
Parameter choices:
Presumably these are the parameters from the original paper ...
parConv <- c(a=0.0005893,b=0.0043836,c=0.0828424,
d=0.000706,e=9.927863,f=22.197312,g=0.00004948,h=1.10003)
Perturbed parameters:
parStart <- parConv
parStart["a"] <- parStart["a"]+3e-4
The formulae:
HP8 <-function(parS,x)
with(as.list(parS),
ifelse(x==0, a^((x+b)^c) + g*h^x,
a^((x+b)^c) + d*exp(-e*(log(x/f))^2) + g*h^x))
## Define qx = HP8/(1+HP8)
qxPred <- function(parS,x) {
h <- HP8(parS,x)
h/(1+h)
}
## Calculate nqx predicted by HP8 model (nqxPred(parStart,x))
nqxPred <- function(parS,x)
(1 -(1-qxPred(parS,x)) * (1-qxPred(parS,x+1)) *
(1-qxPred(parS,x+2)) * (1-qxPred(parS,x+3)) *
(1-qxPred(parS,x+4)))
##Define Residual Function, the relative squared distance is minimized
ResidFun <- function(parS, Observed,x) (nqxPred(parS,x)/Observed-1)
n.b. this is changed slightly from the OP's version; nls.lm wants residuals, not squared residuals.
A sum-of-squares function for use with other optimizers:
ssqfun <- function(parS, Observed, x) {
sum(ResidFun(parS, Observed, x)^2)
}
Applying nls.lm. (Not sure why ftol and ptol were lowered
from sqrt(.Machine$double.eps) to .Machine$double.eps -- the
former is generally a practical limit to precision ...
nls.out <- nls.lm(par=parStart, fn = ResidFun,
Observed = d$MORTALITY, x = d$AGE,
control = nls.lm.control(nprint=0,
ftol = .Machine$double.eps,
ptol = .Machine$double.eps,
maxfev=10000, maxiter = 1000))
parNLS <- coef(nls.out)
pred0 <- nqxPred(as.list(parConv),d$AGE)
pred1 <- nqxPred(as.list(parNLS),d$AGE)
dPred <- with(d,rbind(data.frame(AGE,MORTALITY=pred0,w="conv"),
data.frame(AGE,MORTALITY=pred1,w="nls")))
g1 + geom_line(data=dPred,aes(colour=w))
The lines are indistinguishable, but the parameters have some big
differences:
round(cbind(parNLS,parConv),5)
## parNLS parConv
## a 1.00000 0.00059
## b 50.46708 0.00438
## c 3.56799 0.08284
## d 0.00072 0.00071
## e 6.05200 9.92786
## f 21.82347 22.19731
## g 0.00005 0.00005
## h 1.10026 1.10003
d,f,g,h are close, but a,b,c are orders of magnitude different and e is 50% different.
Looking at the original equations, what's happening here is that a^((x+b)^c) is getting set to a constant, because a is approaching 1: once a is approximately 1, b and c are essentially irrelevant.
Let's check the correlation (we need a generalized inverse because
the matrix is so strongly correlated):
obj <- nls.out
vcov <- with(obj,deviance/(length(fvec) - length(par)) *
MASS::ginv(hessian))
cmat <- round(cov2cor(vcov),1)
dimnames(cmat) <- list(letters[1:8],letters[1:8])
## a b c d e f g h
## a 1.0 0.0 0.0 0.0 0.0 0.0 -0.1 0.0
## b 0.0 1.0 -1.0 1.0 -1.0 -1.0 -0.4 -1.0
## c 0.0 -1.0 1.0 -1.0 1.0 1.0 0.4 1.0
## d 0.0 1.0 -1.0 1.0 -1.0 -1.0 -0.4 -1.0
## e 0.0 -1.0 1.0 -1.0 1.0 1.0 0.4 1.0
## f 0.0 -1.0 1.0 -1.0 1.0 1.0 0.4 1.0
## g -0.1 -0.4 0.4 -0.4 0.4 0.4 1.0 0.4
## h 0.0 -1.0 1.0 -1.0 1.0 1.0 0.4 1.0
This is not actually so useful -- it really just confirms that lots
of the variables are strongly correlated ...
library(optimx)
mvec <- c('Nelder-Mead','BFGS','CG','L-BFGS-B',
'nlm','nlminb','spg','ucminf')
opt1 <- optimx(par=parStart, fn = ssqfun,
Observed = d$MORTALITY, x = d$AGE,
itnmax=5000,
method=mvec,control=list(kkt=TRUE))
## control=list(all.methods=TRUE,kkt=TRUE)) ## Boom!
## fvalues method fns grs itns conv KKT1 KKT2 xtimes
## 2 8.988466e+307 BFGS NA NULL NULL 9999 NA NA 0
## 3 8.988466e+307 CG NA NULL NULL 9999 NA NA 0
## 4 8.988466e+307 L-BFGS-B NA NULL NULL 9999 NA NA 0
## 5 8.988466e+307 nlm NA NA NA 9999 NA NA 0
## 7 0.3400858 spg 1 NA 1 3 NA NA 0.064
## 8 0.3400858 ucminf 1 1 NULL 0 NA NA 0.032
## 1 0.06099295 Nelder-Mead 501 NA NULL 1 NA NA 0.252
## 6 0.009275733 nlminb 200 1204 145 1 NA NA 0.708
This warns about bad scaling, and also finds a variety of different
answers: only ucminf claims to have converged, but nlminb gets a
better answer -- and the itnmax parameter seems to be ignored ...
opt2 <- nlminb(start=parStart, objective = ssqfun,
Observed = d$MORTALITY, x = d$AGE,
control= list(eval.max=5000,iter.max=5000))
parNLM <- opt2$par
Finishes, but with a false convergence warning ...
round(cbind(parNLS,parConv,parNLM),5)
## parNLS parConv parNLM
## a 1.00000 0.00059 1.00000
## b 50.46708 0.00438 55.37270
## c 3.56799 0.08284 3.89162
## d 0.00072 0.00071 0.00072
## e 6.05200 9.92786 6.04416
## f 21.82347 22.19731 21.82292
## g 0.00005 0.00005 0.00005
## h 1.10026 1.10003 1.10026
sapply(list(parNLS,parConv,parNLM),
ssqfun,Observed=d$MORTALITY,x=d$AGE)
## [1] 0.006346250 0.049972367 0.006315034
It looks like nlminb and minpack.lm are getting similar answers, and are actually doing better than the originally stated parameters (by quite a bit):
pred2 <- nqxPred(as.list(parNLM),d$AGE)
dPred <- with(d,rbind(dPred,
data.frame(AGE,MORTALITY=pred2,w="nlminb")))
g1 + geom_line(data=dPred,aes(colour=w))
ggsave("cmpplot.png")
ggplot(data=dPred,aes(x=AGE,y=MORTALITY-d$MORTALITY,colour=w))+
geom_line()+geom_point(aes(shape=w),alpha=0.3)
ggsave("residplot.png")
Other things one could try would be:
appropriate scaling -- although a quick test of this doesn't seem to help that much
provide analytical gradients
use AD Model Builder
use the slice function from bbmle to explore whether the old and new parameters seem to represent distinct minima, or whether the old parameters are just a false convergence ...
get the KKT (Karsh-Kuhn-Tucker) criterion calculators from optimx or related packages working for similar checks
PS: the largest deviations (by far) are for the oldest age classes, which probably also have small samples. From a statistical point of view it would probably be worth doing a fit that weighted by the precision of the individual points ...
#BenBolker, fitting the parameters with the entire dataset (underlying qx) values. Still not able to reproduce parameters
library(minpack.lm)
library(ggplot2)
library(optimx)
getwd()
d <- data.frame(AGE = seq(0,74), MORTALITY=c(869,58,40,37,36,35,32,28,29,23,24,22,24,28,
33,52,57,77,93,103,103,109,105,114,108,112,119,
125,117,127,125,134,134,131,152,179,173,182,199,
203,232,245,296,315,335,356,405,438,445,535,594,
623,693,749,816,915,994,1128,1172,1294,1473,
1544,1721,1967,2129,2331,2559,2901,3203,3470,
3782,4348,4714,5245,5646))
d$MORTALITY <- d$MORTALITY/100000
ggplot(d,aes(AGE,MORTALITY))+geom_point()
##Not allowed to post Images
g1 <- ggplot(d,aes(AGE,MORTALITY))+geom_point()
g1+geom_smooth()## with loess fit
Reported Parameters:
parConv <- c(a=0.0005893,b=0.0043836,c=0.0828424,d=0.000706,e=9.927863,f=22.197312,
g=0.00004948,h=1.10003)
parStart <- parConv
parStart["a"] <- parStart["a"]+3e-4
## Define qx = HP8/(1+HP8)
HP8 <-function(parS,x)
with(as.list(parS),
ifelse(x==0, a^((x+b)^c) + g*h^x, a^((x+b)^c) + d*exp(-e*(log(x/f))^2) + g*h^x))
qxPred <- function(parS,x) {
h <- HP8(parS,x)
h/(1+h)
}
##Define Residual Function, the relative squared distance is minimized,
ResidFun <- function(parS, Observed,x) (qxPred(parS,x)/Observed-1)
ssqfun <- function(parS, Observed, x) {
sum(ResidFun(parS, Observed, x)^2)
}
nls.out <- nls.lm(par=parStart, fn = ResidFun, Observed = d$MORTALITY, x = d$AGE,
control = nls.lm.control(nprint=1, ftol = sqrt(.Machine$double.eps),
ptol = sqrt(.Machine$double.eps), maxfev=1000, maxiter=1000))
parNLS <- coef(nls.out)
pred0 <- qxPred(as.list(parConv),d$AGE)
pred1 <- qxPred(as.list(parNLS),d$AGE)
#Binds Row wise the dataframes from pred0 and pred1
dPred <- with(d,rbind(data.frame(AGE,MORTALITY=pred0,w="conv"),
data.frame(AGE,MORTALITY=pred1,w="nls")))
g1 + geom_line(data=dPred,aes(colour=w))
round(cbind(parNLS,parConv),7)
mvec <- c('Nelder-Mead','BFGS','CG','L-BFGS-B','nlm','nlminb','spg','ucminf')
opt1 <- optimx(par=parStart, fn = ssqfun,
Observed = d$MORTALITY, x = d$AGE,
itnmax=5000,
method=mvec, control=list(all.methods=TRUE,kkt=TRUE,)
## control=list(all.methods=TRUE,kkt=TRUE)) ## Boom
get.result(opt1, attribute= c("fvalues","method", "grs", "itns",
"conv", "KKT1", "KKT2", "xtimes"))
## method fvalues grs itns conv KKT1 KKT2 xtimes
##5 nlm 8.988466e+307 NA NA 9999 NA NA 0
##4 L-BFGS-B 8.988466e+307 NULL NULL 9999 NA NA 0
##2 CG 8.988466e+307 NULL NULL 9999 NA NA 0.02
##1 BFGS 8.988466e+307 NULL NULL 9999 NA NA 0
##3 Nelder-Mead 0.5673864 NA NULL 0 NA NA 0.42
##6 nlminb 0.4127198 546 62 0 NA NA 0.17
opt2 <- nlminb(start=parStart, objective = ssqfun,
Observed = d$MORTALITY, x = d$AGE,
control= list(eval.max=5000,iter.max=5000))
parNLM <- opt2$par
Check on parameters:
round(cbind(parNLS,parConv,parNLM),5)
## parNLS parConv parNLM
##a 0.00058 0.00059 0.00058
##b 0.00369 0.00438 0.00369
##c 0.08065 0.08284 0.08065
##d 0.00070 0.00071 0.00070
##e 9.30948 9.92786 9.30970
##f 22.30769 22.19731 22.30769
##g 0.00005 0.00005 0.00005
##h 1.10084 1.10003 1.10084
SSE Review:
sapply(list(parNLS,parConv,parNLM),
ssqfun,Observed=d$MORTALITY,x=d$AGE)
##[1] 0.4127198 0.4169513 0.4127198
Not able to upload graphs but the code is here. Still appears that the parameters found in the article are not the best fit when the complete mortality data (not abridged or subset) is used
##pred2 <- qxPred(as.list(parNLM),d$AGE)
##dPred <- with(d,rbind(dPred,
data.frame(AGE,MORTALITY=pred2,w="nlminb")))
##g1 + geom_line(data=dPred,aes(colour=w))
ggplot(data=dPred,aes(x=AGE,y=MORTALITY-d$MORTALITY,colour=w))
+ geom_line()+geom_point(aes(shape=w),alpha=0.3)

Getting frequency values from histogram in R

I know how to draw histograms or other frequency/percentage related tables.
But now I want to know, how can I get those frequency values in a table to use after the fact.
I have a massive dataset, now I draw a histogram with a set binwidth. I want to extract the frequency value (i.e. value on y-axis) that corresponds to each binwidth and save it somewhere.
Can someone please help me with this?
Thank you!
The hist function has a return value (an object of class histogram):
R> res <- hist(rnorm(100))
R> res
$breaks
[1] -4 -3 -2 -1 0 1 2 3 4
$counts
[1] 1 2 17 27 34 16 2 1
$intensities
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$density
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$mids
[1] -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
$xname
[1] "rnorm(100)"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
From ?hist:
Value
an object of class "histogram" which is a list with components:
breaks the n+1 cell boundaries (= breaks if that was a vector).
These are the nominal breaks, not with the boundary fuzz.
counts n integers; for each cell, the number of x[] inside.
density values f^(x[i]), as estimated density values. If
all(diff(breaks) == 1), they are the relative frequencies counts/n
and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i]
= breaks[i].
intensities same as density. Deprecated, but retained for
compatibility.
mids the n cell midpoints.
xname a character string with the actual x argument name.
equidist logical, indicating if the distances between breaks are all
the same.
breaks and density provide just about all you need:
histrv<-hist(x)
histrv$breaks
histrv$density
Just in case someone hits this question with ggplot's geom_histogram in mind, note that there is a way to extract the data from a ggplot object.
The following convenience function outputs a dataframe with the lower limit of each bin (xmin), the upper limit of each bin (xmax), the mid-point of each bin (x), as well as the frequency value (y).
## Convenience function
get_hist <- function(p) {
d <- ggplot_build(p)$data[[1]]
data.frame(x = d$x, xmin = d$xmin, xmax = d$xmax, y = d$y)
}
# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)
# make geom_histogram
p <- ggplot(data = df, aes(x = x)) +
geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
color = "black", fill = "white")
Illustration:
hist = get_hist(p)
head(hist$x)
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
head(hist$y)
## [1] 7 13 24 38 52 57
head(hist$xmax)
## [1] 1 2 3 4 5 6
head(hist$xmin)
## [1] 0 1 2 3 4 5
A related question I answered here (Cumulative histogram with ggplot2).

Resources