error in embedded conditional expression - r

I have a conditional expression that when it is satisfied (==7) by the result of a function (VitPath$states) implies the summation of other embedded conditional expressions when their conditions (<0.1) are also satisfied resulting in G
G<-if(tail(VitPath$states,1)==7)
{if(summary(ResFit)$coef[370]<0.1) summary(ResFit)$coef[64]*summary(ResFit)$coef[91]
else 0 +
if(summary(ResFit)$coef[371]<0.1) summary(ResFit)$coef[65]*summary(ResFit)$coef[93]
else 0 +
if(summary(ResFit)$coef[372]<0.1) summary(ResFit)$coef[66]*summary(ResFit)$coef[95]
else 0 +
if(summary(ResFit)$coef[373]<0.1) summary(ResFit)$coef[67]*summary(ResFit)$coef[97]
else 0 +
if(summary(ResFit)$coef[374]<0.1) summary(ResFit)$coef[68]*summary(ResFit)$coef[99]
else 0 +
if(summary(ResFit)$coef[375]<0.1) summary(ResFit)$coef[69]*summary(ResFit)$coef[101]
else 0 +
if(summary(ResFit)$coef[376]<0.1) summary(ResFit)$coef[70]*summary(ResFit)$coef[103]
else 0 +
if(summary(ResFit)$coef[377]<0.1) summary(ResFit)$coef[71]*summary(ResFit)$coef[105]
else 0 +
if(summary(ResFit)$coef[378]<0.1) summary(ResFit)$coef[72]*summary(ResFit)$coef[107]
else 0} else 0
It seems that the embedded conditional expressions are not properly expresed as it produces an error in the first embedded expression as it seem not to understand the multiplication.
Computing the asymptotic covariance matrix of estimates
Error en if (summary(ResFit)$coef[370] < 0.1) summary(ResFit)$coef[64] * :
valor ausente donde TRUE/FALSE es necesario
Is there a way to do this without changing the present format. When I place a parenthesis or brakets (summary(ResFit)$coef[71]*summary(ResFit)$coef[105]) it also does not work.
Many thanks
EDITED
ResFit is produced by the following expression X beeing a vector of numeric values
ResFit = HMMFit(X, nStates=9, control=list(init="KMEANS"))

Try this. It's a Boolean algebra version; When the test fails, the result will be 0 but when it success it will be the product. I assume you only want the first column of the summary coef matrix since multiplying teh standard errors would not make much sense:
G<-if(tail(VitPath$states,1)==7) { summary(ResFit)$coef[370:378,1]<0.1 )*
( summary(ResFit)$coef[64:72,1])*(summary(ResFit)$coef[91:107,1]) }
This should be much faster, as well as being clearer and easier to maintain. If you are using a modeling function different than lm, you should post further details. I was assuming the structure returned from the $coef-call would be like:
coefficients
a p x 4 matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value. Aliased coefficients are omitted.

Related

Constrained Optimization in R using constOptim

I am trying to find the maximum values of this objective function:
f(x1,x2,x3) = 1300x1 + 600x2 + 500x3
subject to the following constraints
300x1 + 150x2 + 100x3 <= 4,000
90x1 + 30x2 + 40x3 <= 1,000
x1 <= 5
x1, x2, x3 >= 0
Below is the code I am using, which is not returning the values I'm looking for. The outputs for the variables are 9.453022e-12 3.272252e-12 5.548419e-14 and the total value is -1.428002e-08.
I'm new to R. What am I doing wrong? Thank you.
f=function(x) -(1300*x[1]+600*x[2]+500*x[3]) # minimize -f(x,y,z)
inequalities=function(x){ #define the ineqaulities function
h=0
h[1]=-(300*x[1]+150*x[2]+100*x[3]-4000)
h[2]=-(90*x[1]+30*x[2]+40*x[3]-1000)
h[3]=-(1*x[1]+0*x[2]+0*x[3]-5)
return(h)}
g=function(x){ #x,y,z > 0
h=0
h[1]=x[1]
h[2]=x[2]
h[3]=x[3]
return(h)}
p0=c(0,0,0) #give the starting point
y=constrOptim.nl(p0,f,hin=inequalities,heq=g);
print(y$par)
print(y$value)
The documentation says:
heq: a vector function specifying equality constraints such that heq[j] = 0 for all j
So the lower bounds x[1],x[2],x[3] >= 0 you are trying to specify, are actually interpreted as x[1],x[2],x[3] = 0. Hence the solution: 9.453022e-12, 3.272252e-12 5.548419e-14. Your lower bounds need to be incorporated in hin.
Note that there are better, linear solvers for linear problems. Passing on a linear problem to a non-linear solver is not optimal.

R Tricks for dealing with 0/0 and NaNs

I have two Gaussian functions for which certain values I am dividing then using the resulting values for if-else statements and integration. As both go to 0 fairly quickly, I eventually get 0/0 which R returns as NaN. This yields errors in the code. Aside from getting greater precision from say Rmpfr, I was wondering how one might go about dealing with stuff that results in 0/0 or NaN.
Edit: Here is the code I'm using for greater clarity
parameters=c(1,1,2)
R<-function(params,z.){
params[2]*exp(-z.^2/(2*params[3]^2))
}
alpha<-function(params,z.,v.){
1/v.[2]*exp(-(z.-v.[1])^2/(2*v.[2]))
}
total.rel.alpha<-function(params,z.,u.){
final=0
species.number<-dim(u.)[1]
if(is.null(dim(u.))){
return(alpha(params,z.,u.)/R(params,z.))
}
for(i in 1:species.number){
final=final+alpha(params,z.,u.[i,])
}
return(final/R(params,z.))
}
Kz<-function(params,z.,v.,u.){
if(total.rel.alpha(params,z.,u.)>=0 & total.rel.alpha(params,z.,u.)<=1){
alpha(params,z.,v.)
}else if(total.rel.alpha(params,z.,u.)>1){
alpha(params,z.,v.)/total.rel.alpha(params,z.,u.)
}
}
K<-function(params,v.,u.){
integrate(Kz,-Inf,Inf,params=params,v.=v.,u.=u.)$value
}
If I run:
K(parameters, c(1,1), c(1,1))
I get
Error in integrate(Kz, -Inf, Inf, params = params, v. = v., u. = u.) : non-finite function value
I think this is because at the tails I get 0/0 in total.rel.alpha
Work with log(p1/p2) where p1 and p2 are the Gaussian densities in question. You will get a difference of quadratic terms plus some other stuff.
Taking logarithms is a standard approach for working with very small but nonzero probability values.

R histogram breaks Error

I have to prepare an algorithm for my thesis to cross check a theoretical result which is that the binomial model for N periods converges to lognormal distribution for N\to \infty. For those of you not familiar with the concept i have to create an algorithm that takes a starter value and multiplies it with an up-multiplier and a down multiplier and continues to do so for every value for N steps. The algorithm should return a vector whose elements are in the form of StarterValueu^id^{N-i} i=0,\dots,N
the simple algorithm i proposed is
rata<-function(N,r,u,d,S){
length(x)<-N
for(i in 0:N){
x[i]<-S*u^{i}*d^{N-i}
}
return(x)
}
N is the number of periods and the rest are just nonimportant values (u is for the up d for down etc)
In order to extract my results i need to make a histogram of the produced vector's logarithm to prove that they are normally distributed. However for a N=100000( i need an great number of steps to prove convergence) when i type hist(x) i get the error :(invalid number of breaks)
Can anyone help?? thanks in advance.
An example
taf<-rata(100000,1,1.1,0.9,1)
taf1<-log(taf)
hist(taf1,xlim=c(-400,400))
First I fix your function:
rata<-function(N,r,u,d,S){
x <- numeric(N+1)
for(i in 0:N){
x[i]<-S*u^{i}*d^{N-i}
}
return(x)
}
Or relying on vectorization:
rata<-function(N,r,u,d,S){
x<-S*u^{0:N}*d^{N-(0:N)}
return(x)
}
taf<-rata(100000,1,1.1,0.9,1)
Looking at the result, we notice that it contains NaN values:
taf[7440 + 7:8]
#[1] 0 NaN
What happened? Apparently the multiplication became NaN:
1.1^7448*0.9^(1e5-7448)
#[1] NaN
1.1^7448
#[1] Inf
0.9^(1e5-7448)
#[1] 0
Inf * 0
#[1] NaN
Why does an Inf value occur? Well, because of double overflow (read help("double")):
1.1^(7440 + 7:8)
#[1] 1.783719e+308 Inf
You have a similar problem with floating point precision when a multiplicant gets close to 0 (read help(".Machine")).
You may need to use arbitrary precision numbers.

Numerical precision problems in R?

I have a problem with the following function in R:
test <- function(alpha, beta, n){
result <- exp(lgamma(alpha) + lgamma(n + beta) - lgamma(alpha + beta + n) - (lgamma(alpha) + lgamma(beta) - lgamma(alpha + beta)))
return(result)
}
Now if you insert the following values:
betabinom(-0.03292708, -0.3336882, 10)
It should fail and result in a NaN. That is because if we implement the exact function in Excel, we would get a result that is not a number. The implementation in Excel is simple, for J32 is a cell for alpha, K32 beta and L32 for N. The implementation of the resulting cell is given below:
=EXP(GAMMALN(J32)+GAMMALN(L32+K32)-GAMMALN(J32+K32+L32)-(GAMMALN(J32)+GAMMALN(K32)-GAMMALN(J32+K32)))
So this seems to give the correct answer, because the function is only defined for alpha and beta greater than zero and n greater or equal to zero. Therefore I am wondering what is happening here? I have also tried the package Rmpf to increase the numerical accuracy, but that does not seem to do anything.
Thanks
tl;dr log(gamma(x)) is defined more generally than you think, or than Excel thinks. If you want your function not to accept negative values of alpha and beta, or to return NaN, just test manually and return the appropriate values (if (alpha<0 || beta<0) return(NaN)).
It's not a numerical accuracy problem, it's a definition issue. The Gamma function is defined for negative real values: ?lgamma says:
The gamma function is defined by (Abramowitz and Stegun section 6.1.1, page 255)
Gamma(x) = integral_0^Inf t^(x-1) exp(-t) dt
for all real ‘x’ except zero and negative integers (when ‘NaN’ is returned).
Furthermore, referring to lgamma ...
... and the natural logarithm of the absolute value of the gamma function ...
(emphasis in original)
curve(lgamma(x),-1,1)
gamma(-0.1) ## -10.68629
log(gamma(-0.1)+0i) ## 2.368961+3.141593i
log(abs(gamma(-0.1)) ## 2.368961
lgamma(-0.1) ## 2.368961
Wolfram Alpha agrees with second calculation.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Resources