I'm fairly new to R and I'm trying to simulate some data, fit it to a model, and do a runs test of the residuals. However, I get a strange type error when performing the runs test.
Here is my code:
library(TSA)
d = arima.sim(list(ma=c(0.5)), n=5000)
model = arima(d, order=c(0, 0, 1), include.mean=FALSE)
runs(model$residuals)
The error is:
Error in if (pvalue > 0.5) pvalue <- 1 - pvalue :
missing value where TRUE/FALSE needed
What does this error mean?
I'm not familiar with this package but I can tell that either your data is a particular case or this package (runs function at least) needs a re-review.
pdf <- pdf/sum(pdf) # pdf contains Inf, so becomes NaN
mu <- 1 + 2 * n1 * n2/(n1 + n2)
if (r1 <= mu) # This is verified
pvalue <- sum(pdf[(1:l2) <= r1]) # pvalue is not calculated, pdf is all NaN
if (r1 > mu) # there should be some R versions without else...
pvalue <- sum(pdf[(1:l2) >= r1])
if (pvalue > 0.5) # This gives you the error! pvalue is all NaN
pvalue <- 1 - pvalue
few lines above pdf gets the Inf values:
for (i in seq(4, l2, 2)) { # when i is 166 pdf gets its first Inf value
r <- (i - 2)/2
f[r + 1] <- (n1 - r) * (n2 - r)/r/r * f[r]
pdf[i] <- f[r + 1]
}
I can't go any further cause I don't know neither how such data is supposed to look like nor the results you should get from such function. Try to give a look yourself into the function, I had enough :-) the missing else in the code above is not the only odd thing into it. Hope that helped
Related
I run the pairwise.wilcox.test() on a data with many ties, I get the following warning:
Warning in wilcox.test.default(xi, xj, paired = paired, ...) :
cannot compute exact p-value with ties
I would like to know how does wilcox.test() handle the ties?
What method is used (by default) to rank the observations?
What does "P value adjustment method: holm" mean?
When there are ties, wilcox.test uses a Normal approximation. You can see the code here: here is a slightly simplified version.
## example values
x <- 1:5
y <- 2:6
## assumes mu=0
r <- c(x, y)
## slightly simplified (assumes `digits.rank` is equal to its default `Inf` value)
r <- rank(r)
NTIES <- table(r)
n.x <- length(x)
n.y <- length(y)
STATISTIC <- c("W" = sum(r[seq_along(x)]) - n.x * (n.x + 1) / 2)
z <- STATISTIC - n.x * n.y / 2
SIGMA <- sqrt((n.x * n.y / 12) *
((n.x + n.y + 1)
- sum(NTIES^3 - NTIES) ## this will be zero in the absence of ties
/ ((n.x + n.y) * (n.x + n.y - 1))))
## stuff about continuity correction omitted here
z <- z/SIGMA ## z-score, used to compute p-value
2*pnorm(z) ## 2-tailed p-value (skipped testing whether in lower or upper tail)
This gives the same p-value as wilcox.test(x, y, correct = FALSE).
As for p-value adjustment ("holm"), this points you to the help page for ?p.adjust, which says that it is using the method from Holm (1979). You can find out more about the method here (for example).
Holm, S. (1979). A simple sequentially rejective multiple test
procedure. Scandinavian Journal of Statistics, 6, 65-70.
https://www.jstor.org/stable/4615733.
So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).
My question relates to the use of R for the derivation of maximum likelihood estimates of parameters when a probability distributions is expressed in the form of an infinite sum, such as the one below due to Rao, Girija et al.
I wanted to see if I could reproduce the maximum likelihood estimates obtained by these authors (who used Matlab, rather than R) when the model is applied to a given set of data. My attempt is given below, although this throws up several warnings that "longer object length is not a multiple of shorter object length". I know why I am getting this warning, but I do not know how to remedy it. How can I edit my code to overcome this?
Also, is there a better way to handle infinite sums? Here I'm just using an arbitrary large number for n (1000).
library(bbmle)
svec <- list(c=1,lambda=1)
x <- scan(textConnection("0.1396263 0.1570796 0.2268928 0.2268928 0.2443461 0.3141593 0.3839724 0.4712389 0.5235988 0.5934119 0.6632251 0.6632251 0.6981317 0.7679449 0.7853982 0.8203047 0.8377580 0.8377580 0.8377580 0.8377580 0.8726646 0.9250245 0.9773844 0.9948377 1.0122910 1.0122910 1.0646508 1.0995574 1.1170107 1.1170107 1.1170107 1.1344640 1.1344640 1.1868239 1.2217305 1.2740904 1.3613568 1.3613568 1.3613568 1.4486233 1.4486233 1.5358897 1.5358897 1.5358897 1.5707963 1.6057029 1.6057029 1.6231562 1.6580628 1.6755161 1.7104227 1.7453293 1.7976891 1.8500490 1.9722221 2.0594885 2.4085544 2.6703538 2.6703538 2.7052603 3.5604717 3.7524579 3.8920842 3.9444441 4.1364303 4.1538836 4.2411501 4.2586034 4.3633231 4.3807764 4.4854962 4.6774824 4.9741884 5.5676003 5.9864793 6.1086524"))
dL <- function(x, c,lambda,n = 1000,log=TRUE) {
k <- 0:n
r <- log(sum(lambda*c*(x+2*k*pi)^(-c-1)*(exp(-(x+2*k*pi)^(-c))^(lambda))))
if (log) return(r) else return(exp(r))
}
dat <- data.frame(x)
m1 <- mle2( x ~ dL(c,lambda),
data=dat,
start=svec,
control=list(parscale=unlist(svec)),
method="L-BFGS-B",
lower=c(0,0)
)
I suggest starting out with that algorithm and making a density function that can be tested for proper behavior by integrating over its range of definition, (c(0, 2*pi). You are calling it a "probability function" but that is a term that I associate with CDF's rather than density distributions (PDF's):
dL <- function(x, c=1,lambda=1,n = 1000, log=FALSE) {
k <- 0:n
r <- sum(lambda*c*(x+2*k*pi)^(-c-1)*(exp(-(x+2*k*pi)^(-c))^(lambda)))
if (log) {log(r) }
}
vdL <- Vectorize(dL)
integrate(vdL, 0,2*pi)
#0.999841 with absolute error < 9.3e-06
LL <- function(x, c, lambda){ -sum( log( vdL(x, c, lambda))) }
(I think you were trying to pack too much into your log-likelihood function so I decide to break apart the steps.)
When I ran that version I got a warning message from the final mle2 step that I didn't like and I thought it might be the case that this density function was occasionally returning negative values, so this was my final version:
dL <- function(x, c=1,lambda=1,n = 1000) {
k <- 0:n
r <- max( sum(lambda*c*(x+2*k*pi)^(-c-1)*(exp(-(x+2*k*pi)^(-c))^(lambda))), 0.00000001)
}
vdL <- Vectorize(dL)
integrate(vdL, 0,2*pi)
#0.999841 with absolute error < 9.3e-06
LL <- function(x, c, lambda){ -sum( log( vdL(x, c, lambda))) }
(m0 <- mle2(LL,start=list(c=0.2,lambda=1),data=list(x=x)))
#------------------------
Call:
mle2(minuslogl = LL, start = list(c = 0.2, lambda = 1), data = list(x = x))
Coefficients:
c lambda
0.9009665 1.1372237
Log-likelihood: -116.96
(The warning and the warning-free LL numbers were the same.)
So I guess I think you were attempting to pack too much into your definition of a log-likelihood function and got tripped up somewhere. There should have been two summations, one for the density approximation and a second one for the summation of the log-likelihood. The numbers in those summations would have been different, hence the error you were seeing. Unpacking the steps allowed success at least to the extent of not throwing errors. I'm not sure what that density represents and cannot verify correctness.
As far as the question of whether there is a better way to approximate an infinite series, the answer hinges on what is known about the rate of convergence of the partial sums, and whether you can set up a tolerance value to compare successive values and stop calculations after a smaller number of terms.
When I look at the density, it makes me wonder if it applies to some scattering process:
curve(vdL(x, c=.9, lambda=1.137), 0.00001, 2*pi)
You can examine the speed of convergence by looking at the ratios of successive terms. Here's a function that does that for the first 10 terms at an arbitrary x:
> ratios <- function(x, c=1, lambda=1) {lambda*c*(x+2*(1:11)*pi)^(-c-1)*(exp(-(x+2*(1:10)*pi)^(-c))^(lambda))/lambda*c*(x+2*(0:10)*pi)^(-c-1)*(exp(-(x+2*(0:10)*pi)^(-c))^(lambda)) }
> ratios(0.5)
[1] 1.015263e-02 1.017560e-04 1.376150e-05 3.712618e-06 1.392658e-06 6.351874e-07 3.299032e-07 1.880054e-07
[9] 1.148694e-07 7.409595e-08 4.369854e-08
Warning message:
In lambda * c * (x + 2 * (1:11) * pi)^(-c - 1) * (exp(-(x + 2 * :
longer object length is not a multiple of shorter object length
> ratios(0.05)
[1] 1.755301e-08 1.235632e-04 1.541082e-05 4.024074e-06 1.482741e-06 6.686497e-07 3.445688e-07 1.952358e-07
[9] 1.187626e-07 7.634088e-08 4.443193e-08
Warning message:
In lambda * c * (x + 2 * (1:11) * pi)^(-c - 1) * (exp(-(x + 2 * :
longer object length is not a multiple of shorter object length
> ratios(0.5)
[1] 1.015263e-02 1.017560e-04 1.376150e-05 3.712618e-06 1.392658e-06 6.351874e-07 3.299032e-07 1.880054e-07
[9] 1.148694e-07 7.409595e-08 4.369854e-08
Warning message:
In lambda * c * (x + 2 * (1:11) * pi)^(-c - 1) * (exp(-(x + 2 * :
longer object length is not a multiple of shorter object length
That looks like pretty rapid convergence to me, so I'm guessing that you could use only the first 20 terms and get similar results. With 20 terms the results look like:
> integrate(vdL, 0,2*pi)
0.9924498 with absolute error < 9.3e-06
> (m0 <- mle2(LL,start=list(c=0.2,lambda=1),data=list(x=x)))
Call:
mle2(minuslogl = LL, start = list(c = 0.2, lambda = 1), data = list(x = x))
Coefficients:
c lambda
0.9542066 1.1098169
Log-likelihood: -117.83
Since you never attempt to interpret a LL in isolation but rather look at differences, I'm guessing that the minor difference will not affect your inferences adversely.
Im trying to use DEoptim to find the global minimum of z in in -1 < x < 1 , -1 < y < 1, but im getting Error in FUN(newX[, i], ...) : argument "y" is missing, with no default and I dont know what im supposed to do for the mission "y"
install.packages("Rmpfr")
install.packages("DEoptim")
library(gmp)
library(Rmpfr)
library(parallel) # https://cran.r-project.org/web/packages/DEoptim/vignettes/DEoptim.pdf
library(DEoptim)
z = function(x,y) {
(exp(sin(60.0*x)) + sin(50.0*exp(y)) + sin(80.0*sin(x)) + sin(sin(70.0*y)) - sin(10.0*(x+y)) + (x*x+y*y)/4.0)
}
optimized_Minimum <- DEoptim(z, lower = c(-1,-1), upper = c(1,1),
control=list(storepopfrom=1, trace=FALSE))
# optimized_Minimum <- optim(z, lower = c(-1,-1), upper = c(1,1), method = "Brent")
DEoptim is not expecting you to pass it 2 separate arguments to your function (x and y), but you can still solve for multiple variables.
You need to pass in a vector rather than 2 separate variables with the DEoptim package, as with the optim function.
I tested this with the functions from the linked solution and it worked:
fxcalc <- function(s,t){(1-(1-(parametros$ap/xm)^(s))^t)*100}
suma <- function(s,t){(parametros$fx-fxcalc(s,t))^2}
func <- function(st){
s <- st[1]
t <- st[2]
sum(suma(s,t))
}
optimized_Minimum <- DEoptim(func, lower = c(-1,-1), upper = c(1,1),
control=list(storepopfrom=1, trace=FALSE))
summary(optimized_Minimum)
***** summary of DEoptim object *****
best member : 1 1
best value : 0
after : 200 generations
fn evaluated : 402 times
*************************************
fun1 = function(y,mu=mu0,lsig=lsig0) {
res = 1/(exp(-y)+1)^2 * 1/sqrt(2*pi)/exp(lsig) * exp(-(y-mu)^2/2/exp(lsig)^2)
return(res)
}
fun4 = function(para=c(mu1,lsig1)) {
mu1 = para[1]
lsig1 = para[2]
res = n1 * log(noze(integrate(fun1,-Inf,Inf,mu=mu1,lsig=lsig1)$value)) +
n3 * log(noze(integrate(fun2,-Inf,Inf,mu=mu1,lsig=lsig1)$value)) +
n2 * log(noze(integrate(fun3,-Inf,Inf,mu=mu1,lsig=lsig1)$value))
return(-res)
}
noze = function(x) {
if (x < 1e-100) { x = 1e-100 }
return(x)
}
optim(c(0.5,2),fun4,method="L-BFGS-B",lower=c(-5,-3),upper=c(3.5,3.5))$par
I have to find two parameters of function 'fun4' which uses the integral of 'fun1.' ('fun2' and 'fun3' are slightly different from 'fun1')
I encountered an error 'Error in integrate(fun1, -Inf, Inf, -3.9538, -3) :
the integral is probably divergent'
Using scatterplot, I found that fun1 is close to zero almost everywhere except for (-4.2,-3.7).
Thus, integrating for that interval only gives (approximately) correct integral.
> integrate(fun1,-4.2,-3.6,-3.9538,-3)
0.0003558953 with absolute error < 3e-11
This can be confirmed using nearby parameter values
> integrate(fun1,-Inf,Inf,-3.9538,-3.1)
0.0003555906 with absolute error < 2.6e-05
> integrate(fun1,-Inf,Inf,-3.9538,-2.85)
0.0003564842 with absolute error < 3.7e-06
If the interval is too wide, it gives incorrect integral.
> integrate(fun1,-5,5,-3.9538,-3)
0.0003558953 with absolute error < 2.3e-08
> integrate(fun1,-15,15,-3.9538,-3)
3.492547e-11 with absolute error < 6.5e-11
> integrate(fun1,-30,30,-3.9538,-3)
1.980146e-41 with absolute error < 3.4e-41
> integrate(fun1,-50,50,-3.9538,-3)
0 with absolute error < 0
> integrate(fun1,-Inf,Inf,-3.9538,-3)
Error in integrate(fun1, -Inf, Inf, -3.9538, -3) :
the integral is probably divergent
If I have to integrate only once, I can find an interval where 'fun1' is large enough and integrate only for that interval.
But the problem is I use optim function which tries various parameters to find a minimizer of 'fun4.'
Using (-Inf,Inf) gives an error and wide enough interval gives incorrect integrals.
Is there a good method to solve this?
Convolution with gaussian kernel might be solved using Gauss-Hermite integration, and there is R package for that: https://cran.r-project.org/web/packages/gaussquad/gaussquad.pdf
Some test code:
library(gaussquad)
n.quad <- 128 # integration order
# get the particular weights/abscissas as data frame with 2 observables and n.quad observations
rule <- ghermite.h.quadrature.rules(n.quad, mu = 0.0)[[n.quad]]
# test function - integrate 1 over exp(-x^2) from -Inf to Inf
# should get sqrt(pi) as an answer
f <- function(x) {
1.0
}
q <- ghermite.h.quadrature(f, rule)
print(q - sqrt(pi))