Manually calculate two-sample kolmogorov-smirnov using ECDF - r

I am trying to manually compute the KS statistic for two random samples. As far as I understood the KS statistic D is the maximum vertical deviation between the two CDFs. However, manually calculating the differences between the two CDF and running the ks.test from the base R yields different results. I wonder where is the mistake.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
### Manual calculation
# function for calculating manually the ecdf
decdf <- function(x, baseline, treatment) ecdf(baseline)(x) - ecdf(treatment)(x)
#Difference between the two CDFs
d <- curve(decdf(x,a,b), from=min(a,b), to=max(a,b))
# getting D
ks <- max(abs(d$y))
#### R-Base calculation
ks.test(a,b)
The R-Base D = 0.0109 while the manual calculation is 0.0088. Any help explaining the difference is appreciated.
I attach the R-Base source code ( a bit cleaned up)
n <- length(a)
n.x <- as.double(n)
n.y <- length(b)
n <- n.x * n.y/(n.x + n.y)
w <- c(a, b)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
STATISTIC <- max(abs(z))

By default, curve evaluates the function on a subdivision of 100 points between from and to. By restricting to these 100 points, it's possible that you miss the value for which the maximum difference is attained.
Instead, evaluate the difference at all points where the ecdf's jump and you are sure to catch the value for which the maximum difference is attained.
set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)
Fa <- ecdf(a)
Fb <- ecdf(b)
x <- c(a,b) # the points where Fa or Fb jump
max(abs(Fa(x) - Fb(x)))
# [1] 0.0109

Related

R: probability / numerical integral of bivariate (or multivariate) kernel density

I am using the package ks for kernel density estimation. Here's an easy example:
n <- 70
x <- rnorm(n)
library(ks)
f_kde <- kde(x)
I am actually interested in the respective exceeding probabilities of my input data, which can be easily returned by ks having f_kde:
p_kde <- pkde(x, f_kde)
This is done in ks with a numerical integration using Simpson's rule. Unfortunately, they only implemented this for a 1d case. In a bivariate case, there's no implementation in ks of any method for returning the probabilities :
y <- rnorm(n)
f_kde <- kde(data.frame(x,y))
# does not work, but it's what I am looking for:
p_kde <- pkde(data.frane(x,y), f_kde)
I couldnt find any package or help searching in stackoverflow to solve this issue in R (some suggestions for Python exist, but I would like to keep it in R). Any line of code or package recommendation is appreciated. Even though I am mostly interested in the bivariate case, any ideas for a multivariate case are appreciated as well.
kde allows multidimensional kernel estimate, so we could use kde to calculate pkde.
For this, we calculate kde on small enough dx and dy steps using eval.points parameter : this gives us the local density estimate on a dx*dy
square.
We verify that the sum of estimates mutiplied by the surface of the squares almost equals 1:
library(ks)
set.seed(1)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
xy <- cbind(x,y)
xmin <- -10
xmax <- 10
dx <- .1
ymin <- -10
ymax <- 10
dy <- .1
pts.x <- seq(xmin, xmax, dx)
pts.y <- seq(ymin, ymax, dy)
pts <- as.data.frame(expand.grid(x = pts.x, y = pts.y))
f_kde <- kde(xy,eval.points=pts)
pts$est <- f_kde$estimate
sum(pts$est)*dx*dy
[1] 0.9998778
You can now query the pts dataframe for the cumulative probability on the area of your choice :
library(data.table)
setDT(pts)
# cumulative density
pts[x < 1 & y < 2 , .(pkde=sum(est)*dx*dy)]
pkde
1: 0.7951228
# average density around a point
tolerance <-.1
pts[pmin(abs(x-1))<tolerance & pmin(abs(y-2))<tolerance, .(kde = mean(est))]
kde
1: 0.01465478

Why manual autocorrelation does not match acf() results?

I'm trying to understand acf and pacf. But do not understand why acf() results do not match simple cor() with lag1
I have simulated a time series
set.seed(100)
ar_sim <- arima.sim(list(order = c(1,0,0), ar = 0.4), n = 100)
ar_sim_t <- ar_sim[1:99]
ar_sim_t1 <- ar_sim[2:100]
cor(ar_sim_t, ar_sim_t1) ## 0.1438489
acf(ar_sim)[[1]][2] ## 0.1432205
Could you please explain why the first lag correlation in acf() does not exactly match the manual cor() between the series and lag1?
The correct way of estimating the autocorrelation of a discrete process with known mean and variance is the following. See, for instance, the Wikipedia.
n <- length(ar_sim)
l <- 1
mu <- mean(ar_sim)
s <- sd(ar_sim)
sum((ar_sim_t - mu)*(ar_sim_t1 - mu))/((n - l)*s^2)
#[1] 0.1432205
This value is not identical to the one computed by the built-in stats::acf but is very close to it.
a.stats <- acf(ar_sim)[[1]][2]
a.manual <- sum((ar_sim_t - mu)*(ar_sim_t1 - mu))/((n - l)*sd(ar_sim)^2)
all.equal(a.stats, a.manual) # TRUE
identical(a.stats, a.manual) # FALSE
a.stats - a.manual
#[1] 1.110223e-16

how to find correlation coefficient in a for loop that is to be repeated 5000 times? and save the statistic

for 2 independent normally distributed variables x and y, they are found using x = rnorm(50) and y = rnorm(50). calculate the correlation 5000 times and save the result each time. What is the likelihood that a correlation with absolute value greater than 0.3 is computed? (default set.seed(42) and to plot a histogram of the coefficient spread)
This is what i have tried so far...
set.seed(42)
n <- 50 #length of random sequence
x_norm <- rnorm(n)
y_norm <- rnorm(n)
nrun <- 5000
corr <- numeric(nrun)
for (i in 1:nrun) {
corrxy <- cor(x_norm,y_norm)
corr[i] <- sum(abs(corrxy > 0.3)) / n #save statistic in the vector
}
hist(corr)
it is expected that i get 5000 different coefficient numbers saved in [i], and when plotted using hist(0), these coefficients should follow approx a normal distribution. but i do not understand how the for loop works and how to incorporate the value of coefficient being greater than 0.3.
I think you were nearly there. You just had to shift some code outside and inside the for loop.
You want new data for each run of the loop (otherwise you get the same correlation 5000 times) and you need to save the correlation each time the loop runs. This results in a vector of 5000 correlations which you can use to look at the proportion of correlations (divide by the number of runs, not the number of observations) that are higher than .3 outside of the for loop.
Edit: One final correction is needed in the bracketing of the absolute function. You want to find the absolute correlations > .3 not the absolute value of corrxy > .3.
set.seed(42)
n <- 50 #length of random sequence
nrun <- 5000
corrxy <- numeric(nrun) # The correlation is the statistic you want to save
for (i in 1:nrun) {
x_norm <- rnorm(n) # Compute a new dataset for each run (otherwise you get the same correlation)
y_norm <- rnorm(n)
corrxy[i] <- cor(x_norm,y_norm) # Calculate the correlation
}
hist(corrxy)
sum(abs(corrxy) > 0.3) / nrun # look at the proportion of runs that have cor > .3
Below is the resulting histogram of the 5000 correlations. The proportion of correlations that is higher than |.3| is 0.034 in this case.
Here's another way of doing this kind of simulations without explicitly calling a loop:
Define first your simulation:
my_sim <- function(n) { # n is the norm distribution size
x <- rnorm(n)
y <- rnorm(n)
corrxy <- cor(x, y)
corrxy # return the correlation (single value)
}
Now we can call this function many times with replicate():
set.seed(123)
nrun <- 10
my_results <- replicate(nrun, my_sim(n=50))
#my_results
# [1] -0.0358698314 -0.0077403045 -0.0512509071 -0.0998484901 0.1230261286 0.1001124010 -0.0002023124
# [8] 0.2017120443 0.0644662387 0.0567232640
Now in my_results you have all the correlations from each simulations (just 10 for example).
And you can compute your statistics:
sum(abs(my_results)> 0.3) / nrun # nrun is 10
or plot:
hist(my_results)

Defining a threshold for results in lsoda, R language

I have a problem with lsoda in deSolve package in R. (It might be applicable to ode function too). I am modeling the dynamics of a food web using a set of ODEs calculating abundances of 5 species in two identical food webs which are connected through dispersal.
the abundances are calculated in 2000 time steps, and they are not supposed to be negative or less than 1e-6. In that case the result should be changed into 0. I could not find any parameter for lsoda to turn negative results into zero. I tried the following trick in my ODE function:
solve.model <- function(t,y, parms){
solve.model <- function(t,y, parms){
y <- ifelse(y<1e-6, 0, y)
#ODE functions here
#...
#...
return(list(dy))
}
but it seems not working. Below is a sample of abundances of species in a web.
I will be very grateful for your help, and hope the sample code can give enough information about my problem.
Babak
P.S. I am solving the following ODE set for the abundances of species(the first two equations) and resource change (third equation)
the corresponding code for the function is as below
solve.model <- function(t, y, parms){
y <- ifelse(y<1e-6, 0, y)
with(parms,{
# return from vector form into matrix form for calculations
(R <- as.matrix(y[(max(no.species)*length(no.species)+1):length(y)]))
(N <- matrix(y[1:(max(no.species)*length(no.species))], ncol=length(no.species)))
dy1 <- matrix(nrow=max(no.species), ncol=length(no.species))
dy2 <- matrix(nrow=length(no.species), ncol=1)
for (i in 1:no.webs){
species <- no.species[i]
(abundance <- N[1:species,i])
adj <- as.matrix(webs[[i]])
a.temp <- a[1:species, 1:species]*adj
b.temp <- b[1:species, 1:species]*adj
h.temp <- h[1:species, 1:species]*adj
#Calculating sigmas in denominator of Holing type II functional response
(sum.over.preys <- abundance%*%(a.temp*h.temp))
(sum.over.predators <- (a.temp*h.temp)%*%abundance)
#Calculating growth of basal
(basal.growth <- basals[,i]*N[,i]*(mu*R[i]/(K+R[i])-m))
# Calculating growth for non-basal species
no.basal <- rep(1,len=species)-basals[1:species]
predator.growth<- rep(0, max(no.species))
(predator.growth[1:species] <- ((abundance%*%(a.temp*b.temp))/(1+sum.over.preys)-m*no.basal)*abundance)
predation <- rep(0, max(no.species))
(predation[1:species] <- (((a.temp*b.temp)%*%abundance)/t(1+sum.over.preys))*abundance)
(pop <- basal.growth + predator.growth - predation)
dy1[,i] <- pop
dy2[i] <- 0.0005 #Change in the resource
}
#Calculating dispersals .they can be easily replaced
# by adjacency maps of connections between food webs arbitrarily!
# added to solve the problem of negative abundances
deltas <- append(c(dy1), dy2)
return(list(append(c(dy1),dy2)))
})
}
this function is used by lsoda by the following call:
temp.abund[[j]] <- lsoda(y=initials, func=solve.model, times=0:max.time, parms=parms)

R optimize not giving the finite minimum but Inf when the search interval is wider

I have a problem with optimize().
When I limit the search in a small interval around zero, e.g., (-1, 1), the optimize algorithm gives a finite minimum with a finite objective function value.
But when I make the interval wider to (-10, 10), then the minimum is on the boundary of the interval and the objective is Inf, which is really puzzling for me.
How can this happen and how to fix this? Thanks a lot in advance.
The following is my code.
set.seed(123)
n <- 120
c <- rnorm(n,mean=1,sd=.3);
eps <- rnorm(n,mean=0,sd=5)
tet <- 32
r <- eps * c^tet
x <- matrix(c(c,r), ncol=2)
g <- function(tet, x){
matrix((x[,1]^(-tet))*x[,2],ncol=1)
}
theta <- 37
g_t <- g(theta,x)
f.tau <- function(tau){
exp.tau.g <- exp(g_t %*% tau)
g.exp <- NULL; i <- 1:n
g.exp <- matrix(exp.tau.g[i,] * g_t[i,], ncol=1)
sum.g.exp <- apply(g.exp,2,sum)
v <- t(sum.g.exp) %*% sum.g.exp
return(v)
}
band.tau <- 1;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-1, 1)"); print(f);
band.tau <- 10;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-10, 10)"); print(f);
The problem is that your function f.tau(x) is not well behaved. You can see that here:
vect.f <- Vectorize(f.tau)
z1 <- seq(-1,1,by=0.01)
z10 <- seq(-10,10,by=0.01)
par(mfrow=c(2,1), mar=c(2,2,1,1))
plot(z1, log(vect.f(z1)), type="l")
plot(z10,log(vect.f(z10)),type="l")
Note that these are plots of log(f.tau). So there are two problems: f.tau(...) has an extremely large slope on either side of the minimum, and f.tau = Inf for x<-0.6 and x>1.0, where Inf means that f.tau(...) is greater than the largest number that can be represented on this system. When you set the range to (-1,1) your starting point is close enough to the minimum that optimize(...) manages to converge. When you set the limits to (-10,10) the starting point is too far away. There are examples in the documentation which show a similar problem with functions that are not nearly as ill-behaved as f.tau.
EDIT (Response to OP's comment)
The main problem is that you are trying to optimize a function which has computational infinities in the interval of interest. Here's a way around that.
band.tau <- 10
z <- seq(-band.tau,band.tau,length=1000)
vect.f <- Vectorize(f.tau)
interval <- range(z[is.finite(vect.f(z))])
f <- optimize(f.tau, interval, tol=1e-20)
f
# $minimum
# [1] 0.001615433
#
# $objective
# [,1]
# [1,] 7.157212e-12
This evaluates f.tau(x) at 1000 equally spaced points on (-band.tau,+band.tau), identifies all the values of x where f.tau is finite, and uses the range as the increment in optimize(...). This works in your case because f.tau(x) does not (appear to...) have asymptotes.

Resources