i'm trying to use R for global sensitivity analysis of a function. I'm completely new to R so I'm having a hard time understanding the documentation correctly.
I want to use the fast99 method from the sensitivity package but it returns NaN for 2 of my 4 factors.
I'm using R Studio and the sensitivity package.
My function is
Func<-function(
Input
){
alpha<-Input[,1]
beta<-Input[,2]
gamma<-Input[,3]
nu<-Input[,4]
root<-4*beta+alpha^2*gamma +2*alpha*beta*gamma*nu+beta^2*gamma*nu^2
denominator<-2*beta*gamma
summand<-alpha*gamma-beta*gamma*nu
result<-(summand+sqrt(gamma)*sqrt(root))/denominator
return(result)
}
And then I call
library(sensitivity)
factors<-c("alpha","beta", "gamma", "nu")
x<-fast99(Mtb, factors=factors, n=1000, q.arg=list(min=0, max=1))
print(x)
I expect the result to be some number for each factor but it returns
Call:
fast99(model = Mtb, factors = factors, n = 1000, q.arg = list(min = 0, max = 1))
Model runs: 4000
Estimations of the indices:
first order total order
alpha NaN NaN
beta 0.23928895 0.8855446
gamma 0.03075694 0.5991250
nu NaN NaN
Which can't be since alpha should be important
I found the Problem if I set the minimum Value to 0.001 it works fine, there seems to be a problem with dividing by zero, which irritates me because in the denominator is only beta and gamma. But now it works fine.
The problem was having 0 as a Minimum Value
Related
Just like the question title.
I have done Ljung box tests in R for model fitting in time-series with constant values (i.e.: 0), and I got perfect model fit and 0 residuals with no surprise. But I want to know why the test returns NA for Q and p-value instead of for example p=0.99999 or something like that.
I want to have a theoretical interpretation for this.
Given you are using stats::Box.test() you can take a look at the code yourself:
utils::getAnywhere(Box.test)
The Ljung-Box Q-statistics is NaN because
cor <- acf (x, lag.max = lag, plot = FALSE, na.action = na.pass)
already returns NaN. So the subsequent computations
obs <- cor$acf[2:(lag+1)]
STATISTIC <- n*(n+2)*sum(1/seq.int(n-1, n-lag)*obs^2)
are NaN too - and so on. So it seems like you should look at stats::acf() what's going on in there ...
utils::getAnywhere(acf)
You should also be able to find the code on Github.
I am trying to plot the log-likelihood function of the Cauchy distribution for varying values of theta (location parameter). These are my observations:
obs<-c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
Here is my log-likelihood function:
ll_c<-function(theta,x_values){
n<-length(x_values)
logl<- -n*log(pi)-sum(log(1+(x_values-theta)^2))
return(logl)
}
and Ive tried making a plot by using this code:
x<-seq(from=-10,to=10,by=0.1);length(x)
theta_null<-NULL
for (i in x){
theta_log<-ll_c(i,counts)
theta_null<-c(theta_null,theta_log)
}
plot(theta_null)
The graph does not look right and for some reason the length of x and theta_null differs.
I am assuming that theta is your location parameter (the scale is set to 1 in my example). You should obtain the same result using a t-distribution with 1 df and shifting the observations by theta. I left some comments in the code as guidance.
obs = c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
ll_c=function(theta, obs)
{
# Compute log-lik for obs and a value of thet (location)
logl= sum(dcauchy(obs, location = theta, scale = 1, log = T))
return(logl)
}
# Loop for possible values of theta(obs given)
x = seq(from=-10,to=10,by=0.1)
ll = NULL
for (i in x)
{
ll = c(ll, ll_c(i, obs))
}
# Plot log-lik vs possible value of theta
plot(x, ll)
It is hard to say exactly what you are experiencing without more info. But I'll make an educated guess.
First of all, we can simplify this a lot by using the *t family of functions for the t distribution, as the cauchy distribution is just the t distribution with df = 1. So your calculations could've been done using
for(i in ncp)
theta_null <- c(theta_null, sum(dt(values, 1, i, log = TRUE)))
Note that multiplying by n doesn't actually matter for any practical purposes. We are usually interested in minimizing/maximizing the likelihood in which case all constants are irrelevant.
Now if we use this approach, we can quite quickly notice something by printing the values:
print(head(theta_null))
[1] -Inf -Inf -Inf -Inf -Inf -Inf
So I am assuming what you are experiencing is that many of your values are "almost" negative infinity, and maybe these are not stored correctly in your outcome vector. I can't see that this should be the case from your code, but this would be my initial guess.
I want to generate 95% confidence intervals from the R2 of a linear model. While developing the code and using the same seed for both approaches, I figured it out that doing the bootstrap manually doesn't give me the same results as using the boot function from the boot package. I am wondering now if I am doing something wrong? or why is this happening?
On the other hand, in order to calculate the 95% CI I was trying to use the confint function, but I'm getting an error "$ operator is invalid for atomic vectors". Any solution to avoid this error?
Here is a reproducible example to explain my concerns
#creating the dataframe
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
#bootstrapping manually
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(100, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
names(B_manually)[1]<- "r_squared"
#Bootstrapping using the function "Boot" from Boot library
set.seed(123)
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=100)
head(B_manually) == head(B_boot$t)
r_squared
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
#Why does the results of the manually vs boot function approach differs if I'm using the same seed?
# 2nd question (Using the confint function to determine the 95 CI gives me an error)
confint(B_manually$r_squared, level = 0.95, method = "quantile")
confint(B_boot$t, level = 0.95, method = "quantile")
#Error: $ operator is invalid for atomic vectors
#NOTE: I already used the boot.ci to determine the 95 confidence interval, as well as the
#quantile function to determine the CI, but the results of these CI differs from each others
#and just wanted to compare with the confint function.
quantile(B_function$t, c(0.025,0.975))
boot.ci(B_function, index=1,type="perc")
Thanks in advance for any help!
The boot package does not use replicate with sample to generate the indices. Check the importance.array function under the source code for boot. It basically generates all the indices at one go. So there's no reason to assume that you will end up with the same indices or same result. Take a step back, the purpose of bootstrap is to use random sampling methods to obtain a estimate of your parameters, you should get similar estimates from different implementation of bootstrap.
For example, you can see the distribution of R^2 is very similar:
set.seed(111)
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(999, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=999)
par(mfrow=c(2,1))
hist(B_manually[,1],breaks=seq(0,0.4,0.01),main="dist of R2 manual")
hist(B_boot$t,breaks=seq(0,0.4,0.01),main="dist of R2 boot")
The function confint you are using, is meant for a lm object, and works on estimating a confidence interval for the coefficient, see help page. It takes the standard error of the coefficient and multiply it by the critical t-value to give you confidence interval. You can check out this book page for the formula. The objects from your bootstrapping are not lm objects and this function doesn't work. It is not meant for any other estimates.
I am using below code to find Negative binomial distribution in R
dnbinom(n11, size=p[1], prob=p[2]/(p[2]+E))
where dnbinom is the function used for finding Negative binomial distribution
n11 & E are vector of integer.
Now i want to run the same code in Julia, which function should i have to use inplace of dnbinom
The function must have arguments as (x,size,prob)
where x = vector of probabilities.
size = target for number of successful trials, or dispersion parameter (the shape parameter of the gamma mixing distribution). Must be strictly positive, need not be integer.
prob = probability of success in each trial. 0 < prob <= 1.
Below is My full Code(Updated as per answers given, but still not working)
using Distributions
data = query("Select count_a,EXP_COUNT from SM_STAT_ALGO_LOCALTRADE_SOC;")
f([0.2,0.06,1.4,1.8,0.1],data[:,1],data[:,2])
function f(x::Vector,n11,E)
return sum(-log(x[5] * pdf(NegativeBinomial(x[1], x[2]/(x[2]+E), n11)) + (1-x[5]) * pdf(NegativeBinomial(x[3], x[4]/(x[4]+E),n11))))
end
Assuming that you want the probabilities of a vector of outcomes, you can do
using Distributions
function dnbinom(x, size, prob)
dist = NegativeBinomial(size,prob)
map(y->pdf(dist,y), x)
end
#show dnbinom([3,5], 10, 0.1)
To get the equivilaent of dbinom in R
dnbinom(1, 1, 0.5)
# [1] 0.25
you can use
using Distributions
pdf(NegativeBinomial(), 1)
# 0.25000000000000006
in julia.
I have to create a model which is a mixture of a normal and log-normal distribution. To create it, I need to estimate the 2 covariance matrixes and the mixing parameter (total =7 parameters) by maximizing the log-likelihood function. This maximization has to be performed by the nlm routine.
As I use relative data, the means are known and equal to 1.
I’ve already tried to do it in 1 dimension (with 1 set of relative data) and it works well. However, when I introduce the 2nd set of relative data I get illogical results for the correlation and a lot of warnings messages (at all 25).
To estimate these parameters I defined first the log-likelihood function with the 2 commands dmvnorm and dlnorm.plus. Then I assign starting values of the parameters and finally I use the nlm routine to estimate the parameters (see script below).
`P <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_P-3000.asc", return.header=
FALSE );
V <- read.ascii.grid("d:/Documents/JOINT_FREQUENCY/grid_E727_V-3000.asc", return.header=
FALSE );
p <- c(P); # tranform matrix into a vector
v <- c(V);
p<- p[!is.na(p)] # removing NA values
v<- v[!is.na(v)]
p_rel <- p/mean(p) #Transforming the data to relative values
v_rel <- v/mean(v)
PV <- cbind(p_rel, v_rel) # create a matrix of vectors
L <- function(par,p_rel,v_rel) {
return (-sum(log( (1- par[7])*dmvnorm(PV, mean=c(1,1), sigma= matrix(c(par[1]^2, par[1]*par[2]
*par[3],par[1]*par[2]*par[3], par[2]^2 ),nrow=2, ncol=2))+
par[7]*dlnorm.rplus(PV, meanlog=c(1,1), varlog= matrix(c(par[4]^2,par[4]*par[5]*par[6],par[4]
*par[5]*par[6],par[5]^2), nrow=2,ncol=2)) )))
}
par.start<- c(0.74, 0.66 ,0.40, 1.4, 1.2, 0.4, 0.5) # log-likelihood estimators
result<-nlm(L,par.start,v_rel=v_rel,p_rel=p_rel, hessian=TRUE, iterlim=200, check.analyticals= TRUE)
Messages d'avis :
1: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
2: In sqrt(2 * pi * det(varlog)) : production de NaN
3: In nlm(L, par.start, p_rel = p_rel, v_rel = v_rel, hessian = TRUE) :
NA/Inf replaced by maximum positive value
4: In log(eigen(sigma, symmetric = TRUE, only.values = TRUE)$values) :
production de NaN
…. Until 25.
par.hat <- result$estimate
cat("sigN_p =", par[1],"\n","sigN_v =", par[2],"\n","rhoN =", par[3],"\n","sigLN_p =", par [4],"\n","sigLN_v =", par[5],"\n","rhoLN =", par[6],"\n","mixing parameter =", par[7],"\n")
sigN_p = 0.5403361
sigN_v = 0.6667375
rhoN = 0.6260181
sigLN_p = 1.705626
sigLN_v = 1.592832
rhoLN = 0.9735974
mixing parameter = 0.8113369`
Does someone know what is wrong in my model or how should I do to find these parameters in 2 dimensions?
Thank you very much for taking time to look at my questions.
Regards,
Gladys Hertzog
When I do these kind of optimization problems, I find that it's important to make sure that all the variables that I'm optimizing over are constrained to plausible values. For example, standard deviation variables have to be positive, and from knowledge of the situation that I'm modelling I'll probably be able to put an upper bound all my standard deviation variables as well. So if s is one of my standard deviation variables, and if m is the maximum value that I want it to take, instead of working with s I'll solve for the variable z which is related to s via
s = m/(1+e-z)
In that formula, z is unconstrained, but s must lie between 0 and m. This is vital because optimization routines where the variables are not constrained to take plausible values will often try completely implausible values while they're trying to bound the solution. Implausible values often cause problems with e.g. precision, that then results in NaN's etc. The general formula that I use for constraining a single variable x to lie between a and b is
x = a + (b - a)/(1+e-z)
However, regarding your particular problem where you're looking for covariance matrices, a more sophisticated approach is necessary than simply bounding all the individual variables. Covariance matrices must be positive semi-definite, so if you're simply optimizing the individual values in the matrix, the optimization will probably fail (producing NaN's) if a matrix which isn't positive definite is fed into the likelihood function. To get round this problem, one approach is to solve for the Cholesky decomposition of the covariance matrix instead of the covariance matrix itself. My guess is that this is probably what's causing your optimization to fail.