R nleqslv difficulties - solving for pH in an acid-base buffer - r

Goal
Build a theoretical titration curve for the phosphoric acid buffer (1M).
I provide a fully reproducible and self-contained example (of my failures ^.^).
Model equations
Acid-base equilibrium equations for phosphoric acid are:
Model implementation
Ka.1 <- 7.1 * 10^-3
Ka.2 <- 6.3 * 10^-8
Ka.3 <- 4.5 * 10^-13
Kw <- 10^-14
balance <- function(vars, Na_ca, P_ca, convert.fun=function(x) x){
# Apply positive only constraint
vars <- convert.fun(vars)
H <- vars[1]
H3A <- vars[2]
H2A <- vars[3]
HA <- vars[4]
A <- vars[5]
Na <- convert.fun(Na_ca)
eq.system <- c(H3A + H2A + HA + A - P_ca,
H + Na - Kw/H - H2A - 2*HA - 3*A,
H * H2A / Ka.1 - H3A,
H * HA / Ka.2 - H2A,
H * A / Ka.3 - HA
)
return(eq.system)
}
Notice that convert.fun is there to try different ways of forcing positive values on concentrations.
The return value is a vector of the model's equations, equated to zero (is this right?).
Iteration
I wished to solve the system for all possible Na+ concentrations, up to 3 equivalence "volumes".
I set initial conditions that woked for the lowest one: [Na]=0.
Then solved it with nleqslv and used the result to "seed" the next iteration.
And it seemed to work nicely:
But, on close inspection, the issues will become obvious.
But, before that, some code!
Setup initial conditions and results matrix:
P_ca <- 1
ci.start <- c(H=10^-1, H3A=0.9, H2A=0.1, HA=0.1, A=0.1)
Na.seq <- seq(from=0,to=3*P_ca,by=P_ca/1000)
varnames <- c("Na", "H", "H3A", "H2A", "HA", "A")
result.m <- matrix(ncol = length(varnames), nrow = length(Na.seq))
colnames(result.m) <- varnames
result.m[,1] <- Na.seq
Iteration:
convert.fun <- function(x) abs(x)
for(i in 1:length(Na.seq)){
Na_ca <- result.m[i,1]
if(i == 1){ # Si es la primera iteración,
ci <- ci.start # usar los valores "start" como C.I.
} else { # Si no,
ci <- result.m[i-1, 2:6] # usar los valores de la solución anterior
}
result <- nleqslv::nleqslv(x = ci,
fn = balance,
Na = Na_ca, P = P_ca,
convert.fun = convert.fun,
method="Newton", #method="Broyden",
global="dbldog",
control = list(allowSingular=TRUE,
maxit=1000))
result$x <- convert.fun(result$x)
result.m[i,2:6] <- result$x
stopifnot(all(result$x >= 0))
} # END LOOP
result.df <- as.data.frame(result.m)
Notice that convert.fun is now abs(x) (is this ok?).
The problem
The problem with the last plot is that the right part of it is flattened out.
The problem is even more obvious in the following plot:
The red curve is supossed to end up at the top, and the purple one at the bottom. This seems to start happening at Na~2, but after a few more iterations, the result flattens out (and becomes exactly constant).
Possible hints for the savvy
The problem is a bit worse using method="Broyden" instead of "Newton".
nleqslv's return message changes from "Function criterion near zero" to "x-values within tolerance 'xtol'".
I also tried adding a Jacobian. That didnt change the result, but at the problematic point I get something like this:
Chkjac possible error in jacobian[2,1] = 2.7598836276240e+06
Estimated[2,1] = 1.1104869955110e+04
I am now really out of ideas! And would really appreciate some help or guidance.

You should always test the termination code of nleqslv to determine if a solution has been found. And somehow display the termination code and/or the message nleqslv returns. You will see that in some case no better point was found. Therefore any result is invalid and useless.
You are using so many values for Na.seq that it is impossible to the wood through the trees.
I would suggest starting with a very limited set of values for Na.seq.
Something like
Na.seq <- seq(from=0,to=3*P_ca,by=P_ca/10)
and also this to include the termination code in the result
varnames <- c("Na", "H", "H3A", "H2A", "HA", "A", "termcd")
result.m <- matrix(ncol = length(varnames), nrow = length(Na.seq))
And then change the iteration loop to this
for(i in 1:length(Na.seq)){
Na_ca <- result.m[i,1]
if(i == 1){ # Si es la primera iteración,
ci <- ci.start # usar los valores "start" como C.I.
} else { # Si no,
ci <- result.m[i-1, 2:6] # usar los valores de la solución anterior
}
iter.trace <- 1
cat("Iteration ",i,"\n\n")
result <- nleqslv::nleqslv(x = ci,
fn = balance,
Na = Na_ca, P = P_ca,
convert.fun = convert.fun,
method="Newton", #method="Broyden",
global="dbldog",
control = list(allowSingular=TRUE,
maxit=1000,trace=iter.trace))
cat("\n\n ",result$message,"\n\n")
result$x <- convert.fun(result$x)
result.m[i,2:6] <- result$x
result.m[i,7] <- result$termcd
stopifnot(all(result$x >= 0))
} # END LOOP
and start analysing the output to find out what the problem is and where.
Addendum
I am reasonably sure that the difficulties with solving are (partly) caused by numerical difficulties. With the above modifications I changed the values for Ka.1, Ka.2, Ka.3,and Kw to
Ka.1 <- 7.1 * 10^-1
Ka.2 <- 6.3 * 10^-3
Ka.3 <- 4.5 * 10^-3
Kw <- 10^-3
and then there are no problems in finding a solution (all termination codes are 1). I suspect that the very small values for the K... constants are the cause of the problem. Check the system for possible errors or try to change the measurement units of the variables.

Solution details
Find details and full code at this repo.
The numerical method worked, and the analytical answer provided at chemistry stackexchange happily coincides :)
Sadly it does not match experimental data from Julia Martín et al (DOI 10.20431/2349-0403.0409002). Perhaps I'll post a question about it on chemistry stackexchange.
My thanks to everyone who helped out <3
Lastly, important plots from the numerical simulation:

Related

Moment Matching Scenario Generation in R

I am working on a portfolio optimazion algorithm and part of the problem consists in generating moment matching scenario.
My choice due to its simplicity and quickness was to go through paper "An algorithm for moment-matching scenario generation with application to financial portfolio optimization" (Ponomareva, Roman and Date).
The problem is that even though the mathematics are very simple, I am stuck by the fact that some of probability weights pi are negative even though the formulas in the paper should ensure otherwise. If I put a loop to run the algorithm until it finds a positive combination it essentially runs forever.
I put the bit of code based on the paper were things get stuck:
dummy1 = 0
while (dummy1 <=0 | dummy1 >= 1) {
dummy1 = round(rnorm(1, mean = 0.5, sd = 0.25), 2)
}
diag.cov.returns = diag(cov.returns)
Z = dummy1 * sqrt (diag.cov.returns) #Vector Z according to paper formula
ZZT = Z %*% t(Z)
LLT = cov.returns - ZZT
L = chol(LLT) #cholesky decomposition to get matrix L
s = sample (1:5, 1)
F1 = 0
F2 = -1
S = (2*N*s)+3
while (((4*F2)-(3*F1*F1)) < 0) {
#Gamma = (2*s*s)*(((N*mean.fourth) - (0.75*(sum(Z^4)* (N*mean.third/sum(Z^3))^2)))/sum(L^4))
#Gamma is necessary if we want to get p from Uniform Distribution
#U = runif(s, 0, 1)
U = rgamma(s, shape = 1, scale = ((1/exp(1)):1))
#p = (s*(N/Gamma)) + ((1/(2*N*s)) - (s/(N*Gamma)))*U
p = (-log(U, base = exp(1)))
p = p/(((2*sum(p))+max(p))*N*s) #this is the array expected to have positive and bounded between 0 and 1
q1 = 1/p
pz = p
p[s+1] = (1-(2*N*sum(p))) #extra point necessary to get the 3 moment mathcing probabilities
F1 = (N*mean.third*sqrt(p[s+1]))/(sum(Z^3))
F2 = p[s+1]*(((N*mean.fourth) - (1/(2*s*s))*sum(L^4)*(sum(1/p)))/sum(Z^4))
}
alpha = (0.5*F1) + 0.5*sqrt((4*F2)-(3*F1*F1))
beta = -(0.5*F1) + 0.5*sqrt((4*F2)-(3*F1*F1))
w1 = 1/(alpha*(alpha+beta))
w2 = 1/(beta*(alpha+beta))
w0 = 1 - (1/(alpha*beta))
P = rep(pz, 2*N) #Vector with Probabilities starting from p + 3 extra probabilities to match third and fourth moments
P[(2*N*s)+1] = p[s+1]*w0
P[(2*N*s)+2] = p[s+1]*w1
P[(2*N*s)+3] = p[s+1]*w2
Unfortunately I cannot discolose the input dataset containing funds returns. However I can surely be more specific. Starting from a data.frame() containing N assets' returns (in my case there 11 funds and monthly returns from 30/01/2001 to 30/09/2020). Once the mean returns, covariance matrix, central third and fourth moments (NOT skewness and kurtosis) and the averages are computed. The algorithm follows as I have reported in the problem. The point where i get stuck is that p takes also negative values. This is a problem since the first s elements of p are later used as probabilities in P.
I hope that in this way the problem is more clear. I also want to add that in the paper the data used by the authors is reported, unfortunately to import them in R would be necessary to import them manually. However I repeat any data.frame() containing assets' returns will do.

How to formulate "variable free in sign" in Linear programming?

I just tried to formulate dea(data development analysis) in R script. And the LP are in following format:
Snapshot of the model
While I don't know how to set "free-in-sign variable" in package "Rglpk"(which I used to solve the LP in R), I set up two variable to represent u1 and u2 in the model.
u1 = u1a - u1b ; u2 = u2a - u2b
And my r code to solve this model is shown below:
f.rhs <- c(rep(0,1,2*N),1)
f.dir <- c(rep("<=",1,2*N),"==")
aux1 <- cbind(-1*X,Z,0*Y,1,-1,0,0)
aux2 <- cbind(0*X,-1*Z,Y,0,0,1,-1)
for (i in 1:N) {
f.obj <- c(rep(0,1,s),as.numeric(Z[i,]),as.numeric(Y[i,]),1,-1,1,-1)
f.con <- rbind(aux1,aux2,c(as.numeric(X[i,]),as.numeric(Z[i,]),rep(0,1,m+4)))
results <- Rglpk_solve_LP(f.obj,f.con,f.dir,f.rhs,max=TRUE)
}
But I didn't get any acceptable results and don't know if there is any problem in my formulation. The outcomes of this model should be in 0~1, but I got a 1.033 and a negative weight in one of the outcome.
Could somebody please guide me how to solve this problem? Thank you!
The docs of Rglpk shows that there is a bound-parameter available.
Use it! (and don't split)
While the split-var-approach should work in general (i did not check your code) it's a bad approach (doubles the amount of variables; possible numerical-instability at least for IPM-based solvers) and more direct-ones are preferred
As GLPK is simplex-based (i think there is a not state-of-the-art IPM too, but simplex is probably default) there is a high probability, that there is a more direct handling of these bounds, therefore, use it (by using the bounds-argument)
Defining a variable as free in this case means: set those bounds to -inf, inf
Something straight from the docs (page 6 of 7; version 0.6-2):
## Same as before but with bounds replaced by
## -Inf < x_1 <= 4
## 0 <= x_2 <= 100
## 2 <= x_3 < Inf
bounds <- list(lower = list(ind = c(1L, 3L), val = c(-Inf, 2)),
upper = list(ind = c(1L, 2L), val = c(4, 100)))
Rglpk_solve_LP(obj, mat, dir, rhs, bounds, types, max)

Implementing a neuralnetwork from scratch in R

I'm working on an assignment for my Machine Learning course, and as part of it I'm trying to implement a neural network. Since it's for school, I have to implement the algorithm manually, and not use any of the neuralnet packages available.
I've been using the material in "Learning from Data" along with the CalTech lectures that follow it on youtube.
I've put together the algorithm in R to the best of my ability, but there's something going wrong along the way. I haven't been able to implement the difference in the cost function as a measure for when the last iteration should be, so for now I've just fixed the number of iterations as a constant.
** Edit **
Hey guys. Thanks for the response. I can see I'm missing a lot of needed information. Sorry about that, don't really know what I was thinking.
The data I'm using is simply "toy data" generated from the sinc function sinc(x)=sin(x)/x.
The problem I'm having specifically is that the estimates that I get at the end of the algorithm are completely off from the real values, and they are significantly different every time I run the algorithm. It seems like I've put the algorithm together the way the book states, but I can't see where the problem is.
Edit 2
Added the data to the code so it can be run without doing anything extra. I also separated the individual parts of the function. As i mentioned in a comment, I was able to numerically verify the partial derivatives, so I think that part is ok. The problem I have is when I need to update the weights in order to train the network.
It's not in this part of the code, but I thought that in order to update the weights, you simply took the old weight and subtracted the partial derivative of that weight scaled by the learning rate? (wNew = wOld - eta*djdwOld)
theta <- function(a){
a / (1+abs(a)) # Here we apply the sigmoid function as our
# non-linearity.
}
theta.prime <- function(a){
1 / (1+abs(a))^2
}
x <- c( 5.949110, -1.036600, 3.256780, 7.824520, -3.606010, 3.115640, -7.786960,
-7.598090, 2.083880, 3.983000, 8.060120, 7.879760, -2.456670,
-2.152720, 3.471950, 3.567960, -4.232630, 6.831610, -9.486860, 8.692330,
-1.551860, 0.917305, 4.669480, -7.760430, 2.835410)
y <- c(-0.10804400, 0.78264000, -0.05313330, 0.13484700, -0.05522470, -0.05758530,
0.19566100, 0.13846000, 0.43534100, -0.16861400, 0.10625000,
0.08427310, 0.27012900, 0.44004800, -0.00880575, -0.10711400, -0.18671100,
0.01158470, 0.02767190, 0.06319830, 0.61802000, 0.87124300,
-0.25668100, 0.06160800, 0.10575700)
inputlayer <- 1
outputlayer <- 1
hiddenlayer <- 2
w1 <- t(matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,inputlayer))
w2 <- matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,outputlayer)
### Forwardprop ###
forward <- function(x,w1,w2,theta){
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
return(yhat)
}
### Forwardpropagation maunally ###
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
### Error function ###
#yhat <- forward(x,w1,w2,theta)
E <- sum((y-yhat)^2)/(length(x))
### Backward Propagation ###
delta3 <- (-2*(y-yhat)) * apply(s3,c(1,2),theta.prime)
djdw2 <- t(a2) %*% delta3
delta2 <- delta3 %*% t(w2) * apply(s2,c(1,2),theta.prime)
djdw1 <- t(x)%*%delta2
### Numerically estimated gradients ###
e <- 1e-8
numgrad1 <- matrix(0,1,2)
eps <- matrix(0,1,2)
w1e <- matrix(0,1,2)
for(j in 1:2) {
eps[1,j] <- e
w1e <- w1 + eps
loss2 <- sum((y-forward(x,w1e,w2,theta))^2)
w1e <- w1
loss1 <- sum((y-forward(x,w1e,w2,theta))^2)
numgrad1[1,j] <- (loss2 - loss1)/(e)
eps[1,j] <- 0
}
numgrad2 <- matrix(0,2,1)
eps <- matrix(0,2,1)
w2e <- matrix(0,2,1)
for(j in 1:2) {
eps[j,1] <- e
w2e <- w2 + eps
loss2 <- sum((y-forward(x,w1,w2e,theta))^2)
w2e <- w2
loss1 <- sum((y-forward(x,w1,w2e,theta))^2)
numgrad2[j,1] <- (loss2 - loss1)/(e)
eps[j,1] <- 0
}
# Comparison of our gradients from backpropagation
# and numerical estimation.
c(djdw1,djdw2)
c(numgrad1,numgrad2)

How does ar.yw estimate the variance

In R, how does the function ar.yw estimate the variance? Specifically, where does the number "var.pred" come from? It does not seem to come from the usual YW estimate of the variance, nor the sum of squared residuals divided by df (even though there is disagreement about what the df should be, none of the choices give an answer equivalent to var.pred). And yes, I know that there are better methods than YW; just trying to figure out what R is doing.
set.seed(82346)
temp <- arima.sim(n=10, list(ar = 0.5), sd=1)
fit <- ar(temp, method = "yule-walker", demean = FALSE, aic=FALSE, order.max=1)
## R's estimate of the sigma squared
fit$var.pred
## YW estimate
sum(temp^2)/10 - fit$ar*sum(temp[2:10]*temp[1:9])/10
## YW if there was a mean
sum((temp-mean(temp))^2)/10 - fit$ar*sum((temp[2:10]-mean(temp))*(temp[1:9]-mean(temp)))/10
## estimate based on residuals, different possible df.
sum(na.omit(fit$resid^2))/10
sum(na.omit(fit$resid^2))/9
sum(na.omit(fit$resid^2))/8
sum(na.omit(fit$resid^2))/7
Need to read the code if it's not documented.
?ar.yw
Which says: "In ar.yw the variance matrix of the innovations is computed from the fitted coefficients and the autocovariance of x." If that is not enough explanation, then you need to look at the code:
methods(ar.yw)
#[1] ar.yw.default* ar.yw.mts*
#see '?methods' for accessing help and source code
getAnywhere(ar.yw.default)
# there are two cases that I see
x <- as.matrix(x)
nser <- ncol(x)
if (nser > 1L) # .... not your situation
#....
else{
r <- as.double(drop(xacf))
z <- .Fortran(C_eureka, as.integer(order.max), r, r,
coefs = double(order.max^2), vars = double(order.max),
double(order.max))
coefs <- matrix(z$coefs, order.max, order.max)
partialacf <- array(diag(coefs), dim = c(order.max, 1L,
1L))
var.pred <- c(r[1L], z$vars)
#.......
order <- if (aic)
(0L:order.max)[xaic == 0L]
else order.max
ar <- if (order)
coefs[order, seq_len(order)]
else numeric()
var.pred <- var.pred[order + 1L]
var.pred <- var.pred * n.used/(n.used - (order + 1L))
So you now need to find the Fortran code for C_eureka. I think I'm finding it here: https://svn.r-project.org/R/trunk/src/library/stats/src/eureka.f This is the code that aI think is returning the var.pred estimate. I'm not a time series guy and It's your responsibility to review this process for applicability to your problem.
subroutine eureka (lr,r,g,f,var,a)
c
c solves Toeplitz matrix equation toep(r)f=g(1+.)
c by Levinson's algorithm
c a is a workspace of size lr, the number
c of equations
c
snipped
c estimate the innovations variance
var(l) = var(l-1) * (1 - f(l,l)*f(l,l))
if (l .eq. lr) return
d = 0.0d0
q = 0.0d0
do 50 i = 1, l
k = l-i+2
d = d + a(i)*r(k)
q = q + f(l,i)*r(k)
50 continue

How can I get the median of all measurements within a certain distance?

There's probably a better way, but since I'm new to R and already had the IDW code set up, I've been trying to get the median of all points within 2000 meters by tweaking the IDW code, setting the weighting power (idp) near zero so closer points are weighted the same as far ones.
I'm guessing it says NA when I run the code below with maxdist=2000 because some points don't have any neighbors within 2000 meters. The smallest maxdist I can get it to work with is ~40,000, even if I set nmin to zero.
Is there a way to tell it to ignore points without neighbors within 2000 meters, or does someone know a better way to do this?
Here's my code:
library(gstat)
clean3145 = read.csv("clean3145.csv")
#Set up the k-fold validation
set.seed(88)
groups <- sample(1:5, nrow(clean3145), replace=TRUE)
#res=result=R=Pearson's correlation between predicted and actual arsenic concentration
MEDres<- rep(NA, 5)
r <- list()
for (k in 1:5) {
print(k)
flush.console()
train <- clean3145[groups!=k, ]
test <- clean3145[groups==k, ]
med <- gstat(formula = As1~1, locations = ~UTMNM+UTMEM, data=train, nmin=0, maxdist=40000, set=list(idp = .01))
medpred <- predict(med, test)$var1.pred
MEDres[k] <- cor(test$As1, medpred)
}
#Show the mean correlation for the 5 different training-test dataset pairs in K-fold validation
mean(MEDres)
Thanks for your help!
I can't see how your code helps answering your original question, but for a local median I would try
library(sp)
demo(meuse, ask = FALSE)
library(gstat)
x = krige(zinc~1, meuse, meuse.grid, maxdist = 1000, set = list(method = "med"))
If a neighbourhood contains no data, you may define it by the number of nearest points, nmax, in which case, of course, distance is no longer controlled.
Thanks Edzer!
I'll save that for future reference. We got it to work this way, with depth criteria too (I'm trying to estimate arsenic in groundwater):
#Load required packages and data
library(raster)
depth = read.csv("depth.csv")
Set up the k-fold validation, making sure the same random sample is chosen each time for comparability
set.seed(88)
groups <- sample(1:5, nrow(depth), replace=TRUE)
Compute median arsenic concentration of all training (trn) wells within a certain point distance (pd) of test (tst) wells, using the UTM east and north coordinates in meters (UTMEM, UTMNM). Ignore or "remove" test wells that do not have neighbors within 148 meters (pd>148=NA, na.rm=TRUE)
computeMed <- function(trn, tst) {
pd <- pointDistance(trn[ , c('UTMEM', 'UTMNM')], tst[ , c('UTMEM','UTMNM')], lonlat=FALSE)
pd[pd > 148] <- NA
as <- trn$As1
as <- matrix(rep(as, ncol(pd)), ncol=ncol(pd))
aspd <- as * (pd >= 0)
apply(aspd, 2, median, na.rm=TRUE)
}
Compute medians again, this time with depth criteria (e.g., if test well is near Fallon (Tcan2car=1=wells from Truckee Canal to Carson Basin and downgradient)and more than 40 m deep, only give median of neighbors that are also >40 m deep)
r <- rd <- list()
Fallon <- FALSE
for (k in 1:5) {
print(k)
flush.console()
depth$deep <- TRUE
depth$deep[depth$Depth_m < 40] <- FALSE
if (Fallon) {
d <- depth[depth$Tcan2car==1]
} else {
d <- depth
}
train <- d[groups!=k, ]
test <- d[groups==k, ]
p <- computeMed(train,test)
r[[k]] <- cbind(k=k, prd=p, obs=test$As1)
pdeep <- computeMed(train[train$deep,],test[test$deep,])
pshallow <- computeMed(train[!train$deep,],test[!test$deep,])
rd[[k]] <- cbind(k=k, prd=c(pdeep, pshallow), obs=test$As1[c(which(test$deep), which(!test$deep))])
}
Show the mean Pearson's R correlation for the 5 different training-test dataset pairs in K-fold validation. cr and r refer to correlations based on distance only. crd and rd refer also include depth criteria
cr <- sapply(r, function(x) {x <- na.omit(x); cor(x[,2:3])[2]})
cr
mean(cr)
crd <- sapply(rd, function(x) {x <- na.omit(x); cor(x[,2:3])[2]})
crd
mean(crd)

Resources