Fast and accurate computation of studentized external residuals in R - r

I want to compute the external studentized residuals of a dataset {x,y} of size n in R given the following constraints:
(very) high precision
high performance (avoiding loops where possible)
R language (including RCPP)
The R code should be fast because it will be used extensively (10^9 times minimum) on multiple data sets with n in [10^3, 10^6]. This question is part of a larger work for estimating a custom statistic that requires the studentized residuals. The most computational part is the one presented here. Thus, solving this would dramatically improve the overall efficiency.
On the lm() regression
To gather the studentized external residuals, one typically runs lm() then rstudent(). The R function uses an aproach that avoid running n regressions for estimating the studentized residuals and that saves a lot of execution time. However, I prefer not to use lm() because I only need the residuals without all that fancy additional stuff that comes with it (thus saving some more execution time).
When trying to decipher the R source code for the external residuals in the lm() I found it somewhat obscur, as it seems to call sample code from other external files (an example is the influence() function). Thus, at this time I failed at collecting enough information to replicate the code section using the source code only.
Relevant topic(s) on Stack
The following relevant topic has been found in Stack: How to compute Studentized Residuals in Python?
A R implementation of the Python procedure including a minimal example is given (corrected by #Stéphane Laurent, see answers):
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
m = 2
mean_y = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_y) %*% (y - mean_y)
beta_1 = ((y - mean_y) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_y
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_y) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
reg = rstudent(lm(x ~ y))
res = cbind(reg, studentized_residuals)
Produce the following differences:
index reg studentized_residuals
1 -0,595911898846465 -0,581348373714385
2 0,116208945967327 0,116097011762269
3 -2,04779452591111 -1,61939642040734
4 2,26350621688535 1,71995630000724
5 0,603322309518977 0,588222428131761
6 -1,5460639774285 -1,33486217871738
7 0,367900050364855 0,364393996552621
8 1,14745971090533 1,05271762293388
9 0,823888320713653 0,786630743176311
10 -0,449839343257121 -0,443475039943641
Minimal example
The following R attemp has been tested using arbitrary datasets, just for illustration purposes.
It uses lm() / rstudent() and is way too slow for our practical application. The two parameters n1 and n2 correspond to the number of iterations and the size of the vector (denoted n in the above) respectively. To match our problem, we would typically pick n1 in [10^6, 10^9] and n2 in [10^3, 10^6] :
Stud = function(n1, n2){
res = data.frame(matrix(vector(), n2, n1))
for(i in 1 : n1){
x = rnorm(n2)
y = rnorm(n2)
reg = lm(x ~ y)
res[, i] = rstudent(reg)
}
}
Update and additional (full) minimal example for benchmark:
Here we show a complete benchmark where various functions of Stack are tested against lm() in the objective of gathering the studentized externals residuals. For gathering these residuals we need to run 'n' regressions. Results are given after the code for 100 and 500 replications.
#Packages
install.packages("Rcpp")
library(Rcpp)
install.packages("RcppArmadillo")
library(RcppArmadillo)
install.packages("RcppEigen")
library(RcppEigen)
install.packages("stats")
library(stats)
install.packages("speedglm")
library(speedglm)
install.packages("Rfast")
library(Rfast)
install.packages("rbenchmark")
library(rbenchmark)
## start from SEXP, most conversions, longest code
src <- '
Rcpp::List fLmSEXP(SEXP Xs, SEXP ys) {
Rcpp::NumericMatrix Xr(Xs);
Rcpp::NumericVector yr(ys);
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Rcpp types are early RcppArmadillo examples did
src <- '
Rcpp::List fLmTwoCasts(Rcpp::NumericMatrix Xr, Rcpp::NumericVector yr) {
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types
src <- '
Rcpp::List fLmOneCast(arma::mat X, arma::colvec y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types passed as constant references
src <- '
Rcpp::List fLmConstRef(const arma::mat & X, const arma::colvec & y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
#Benchmark
data = benchmark("OneCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmOneCast(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"TwoCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmTwoCasts(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Const" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmConstRef(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Sexp" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmSEXP(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fastLm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Speed" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- speedlm(x[-i] ~ y[-i], fitted = T)
sum((x[-i] - fit$fitted.values)^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
".Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lmfit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Lm" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Basic" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
reg <- lm(x ~ y)
reg_stud <- rstudent(reg)
},
replications = 500,
columns = c("test", "elapsed", "replications"))
Results:
On this single benchmark, the rstudent(lm()) is much faster than everything else:
test elapsed replications
7 .Fit 13.84 100
10 Basic 0.25 100
3 Const 7.37 100
5 Fast 99.84 100
8 Fit 7.06 100
9 Lm 105.25 100
1 OneCast 7.61 100
4 Sexp 7.66 100
6 Speed 184.76 100
2 TwoCast 7.17 100
7 .Fit 63.63 500
10 Basic 0.93 500
3 Const 34.44 500
5 Fast 438.95 500
8 Fit 31.11 500
9 Lm 471.37 500
1 OneCast 34.29 500
4 Sexp 33.48 500
6 Speed 794.73 500
2 TwoCast 33.51 500
Interpretation
It seems that R uses an analytical alternative that avoid using 'n' regressions, resulting in a much faster computation.
Thus, the question still remains: How to be competitive in regards to rstudent(lm()), and how to reverse-engeering the original source code (that is difficult to gather) ?
Final results
We compared the solutions of #Onyambu, #tester and #Stéphane Laurent. We found the solution of #Onyambu to be the fastest one for different vector sizes, while providing results exactly equal to those of rstudent().

One gets the same results by replacing your var_e with
var_e = vapply(1:n, function(i){
sigma(lm(x[-i] ~ y[-i]))
}, numeric(1))
To get that efficiently, do not use lm but lm.fit:
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))

EDIT:
the edit is to indicate that a faster_rstudent function than the previously give was found:
fast_rstudent <-function(X, y, intercept = TRUE){
mqr <- .Call(stats:::C_Cdqrls, cbind(intercept, X), y, tol, FALSE)
res <- .Call(stats:::C_influence, mqr, mqr$residuals, 1e-12)
mqr$residuals/(res$sigma*sqrt(1-res$hat))
}
So far this function is very fast.
Previous Answer
Since you are using R, you could use a qr decomposition to solve this. Your aim is to write a rstudent function that is faster than the inbuilt function by getting rid of the overhead function calls etc. That means that you should only use the necessary internal functions. Below is a quick way to do this:
my_rstudent <- function (X, y, intercept = TRUE) {
X <- cbind(intercept, X)
u <- .Call(stats:::C_Cdqrls, X, y, 1e-7, FALSE)
d <- dim(X)
n <- as.integer(d[1L])
k <- as.integer(d[2L])
df_res <- n - k
z <- .Internal(diag(1, n, k))
v <- .Fortran(.F_dqrqy, as.double(u$qr), n, k, as.double(u$qraux),
z, k, qy = z)$qy
h_ii <-.Internal(rowSums(v^2, n, k, FALSE))
rstand <- u$residuals/sqrt(sum(u$residuals**2)/df_res)/sqrt(1-h_ii)
rstand * sqrt((df_res - 1)/( df_res - rstand^2))
}
In a way this function misuses R by almost removing the overhead functions entirely. This assumes that what is being given to the function is correct.
Results:
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
cbind(mine=my_rstudent(x, y), from_R=rstudent(lm(y~x)))
mine from_R
1 0.92113157 0.92113157
2 0.15753536 0.15753536
3 -1.69587949 -1.69587949
4 -3.59182456 -3.59182456
5 0.98274664 0.98274664
6 -0.85765961 -0.85765961
7 -0.07768369 -0.07768369
8 1.05874766 1.05874766
9 0.80181623 0.80181623
10 0.11418833 0.11418833
benchmark:
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 100)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
rstudent(lm(y ~ x)) 45.36667 37.20755 26.89753 24.29545 22.39587 11.31733 100
With a small dataset, the overhead functions quit slow down the computation of rstudent.
Relatively large dataset:
n = 1000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 100)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
rstudent(lm(y ~ x)) 8.530228 8.059269 7.700426 7.848123 7.616909 3.877305 100
huge dataset
n = 1000000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10
rstudent(lm(y ~ x)) 1.510198 1.560989 1.486083 1.666609 1.603455 1.01154 10
Very huge dataset
n = 10000000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10
rstudent(lm(y ~ x)) 1.603652 1.603881 1.534455 1.58802 1.560724 1.305315 10
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)), times = 10)
Unit: seconds
expr min lq mean median uq max neval
my_rstudent(x, y) 1.584408 1.619822 1.727310 1.658917 1.757311 2.213203 10
rstudent(lm(y ~ x)) 2.458445 2.619609 2.705212 2.696705 2.776588 2.949799 10

I think the solution to your problem will be dropping all necessary overhead for the functions first, if that is not fast enough, try to convert the code to C++ and run it with Rccp. It is very likely that you'll be able to improve on my results, if you compute the residuals from .lm.fit using your own implementation, instead of using lm.fit, as I did.
I also checked, if there's a difference in the studentized residuals depending on the function you are going to use (lm, lm.fit, .lm.fit), it turns out that this is the case. However, the residuals from my function here are equal to those produced by MASS::studres for a regression of y ~ x with x having only one column.
Here's my code and a benchmark versus the fastest version from above called "Basic":
library(rbenchmark)
library(microbenchmark)
library(MASS)
set.seed(1)
x <- matrix(rnorm(500), ncol = 1)
y <- matrix(rnorm(500), ncol = 1)
myFunc <- function(x, y, n = 500){
# tmp <- .lm.fit(x, y) # linear model fit
object <- lm.fit(x = x, y = y)
resid <- object$residuals
hat <- lm.influence(object, do.coef = FALSE)$hat
# hat <- hat[hat > 0] # remove checks
# ok <- !(is.na(resid)) # remove checks
# n.miss <- sum(!ok) # remove checks
# resid <- resid[ok] # remove checks
# n <- length(resid)
# p <- object$rank # equal to one
p <- 1
rdf <- n - 1
studres <- resid
stddev <- sqrt(sum(resid^2)/rdf)
sr <- resid/(sqrt(1 - hat) * stddev)
stdres <- sr
studres <- sr/sqrt((n - p - sr^2)/(n - p - 1))
studres <- naresid(object$na.action, studres)
return(studres)
}
test1 <- stats::rstudent(lm(x ~ y)) # rstudent doesn't work with lm.fit
test2 <- MASS::studres(lm(x ~ y))
test3 <- MASS::studres(lm.fit(x, y))
test4 <- myFunc(x, y, n = 500)
> head(cbind(test1, test2, test3, test4))
test1 test2 test3 test4
1 -0.6368094 -0.6368094 0.04696790 0.04696790
2 0.1493050 0.1493050 -0.27286396 -0.27286396
3 -0.8941217 -0.8941217 -1.15505676 -1.15505676
4 1.5598965 1.5598965 0.07729179 0.07729179
5 0.3440252 0.3440252 0.95155123 0.95155123
6 -0.7714317 -0.7714317 1.47600416 1.47600416
####################################
mbm <- microbenchmark("lm" = {rstudent(lm(y~x)) },
"MASS_lm" = {
MASS::studres(lm(y~x))
},
"MASS_lm.fit" = {
MASS::studres(lm.fit(x = x , y = y))
},
"myFunc" = {myFunc(x, y, n = 500)},
times = 100
)
> mbm
Unit: microseconds
expr min lq mean median uq max neval
lm 767.001 869.1510 1188.023 977.1505 1185.5010 8279.801 100
MASS_lm 704.601 909.2000 1085.261 997.3515 1168.8505 2052.202 100
MASS_lm.fit 168.001 195.0510 282.166 212.9510 254.1015 2912.201 100
myFunc 147.901 168.8015 234.261 190.0010 249.7515 1193.701 100
Please note, that you'll have to specify n according to the length of the vector x or y.

Related

Obtain Bootstrap Results in Matrix

I have written the following code.
library(quantreg)
# return the g function:
G = function(m, N, gamma) {
Tm = m * N
k = 1:Tm
Gvalue = sqrt(m) * (1 + k/m) * (k/(m + k))^gamma
return(Gvalue)
}
sqroot <- function(A) {
e = eigen(A)
v = e$vectors
val = e$values
sq = v %*% diag(sqrt(val)) %*% solve(v)
return(t(sq))
}
fa = function(m, N, a) {
Tm = m * N
k = 1:Tm
t = (m + k)/m
f_value = (t - 1) * t * (a^2 + log(t/(t - 1)))
return(sqrt(f_value))
}
m = 50
N = 2
n= 50*3
x1 = matrix(runif(n, 0, 1), ncol = 1)
x = cbind(1, x1)
beta = c(1, 1)
xb = x %*% beta
pr = 1/(1+exp(-xb))
y = rbinom(n,1,pr)
# calculate statistic:
stat = function(y, x, m, N, a) {
y_train = y[1:m]
x_train = x[(1:m),]
y_test = y[-(1:m)]
x_test = x[-(1:m),]
fit = glm(y ~ 0 + x, family="binomial")
coef = coef(fit)
log_predict = predict(fit, type="response")
sigma = sqrt(1/(m-1)* sum((y_train - log_predict)^2))
Jvalue = t(x_train) %*% x_train/m * sigma^2
Jsroot = sqroot(Jvalue)
fvalue = fa(m, N, a)
score1 = apply((x_test * as.vector((y_test - x_test %*% coef))), 2, cumsum)
statvalue1 = t(solve(Jsroot) %*% t(score1))/fvalue/sqrt(m)
statmax1 = pmax(abs(statvalue1[, 1]), abs(statvalue1[, 2]))
result = list(stat = statmax1)
return(result)
}
m =50
N = 2
a = 2.795
value = stat(y, x, m, N, a)
value
I want to perform bootstrap to obtain B = 999 number of statistics. I use the following r code. But it produces an error saying "Error in statistic(data, original, ...) :
argument "m" is missing, with no default"
library(boot)
data1 = data.frame(y = y, x = x1, m = m , N = N, a = a)
head(data1)
boot_value = boot(data1, statistic = stat, R = 999)
Can anyone give me a hint? Also, am I able to get the bootstrap results in a matrix format? Since the stat function gives 100 values.
There are different kinds of bootstrapping. If you want to draw from your data 999 samples with replications of same size of your data you may just use replicate, no need for packages.
We put the data to be resampled into a data frame. It looks to me like m, N, a remain constant, so we just provide it as vectors.
data2 <- data.frame(y=y, x=x)
stat function needs to be adapted to unpack y and x-matrix. At the bottom we remove the list call to get just a vector back. unnameing will just give us the numbers.
stat2 <- function(data, m, N, a) {
y_train <- data[1:m, 1]
x_train <- as.matrix(data[1:m, 2:3])
y_test <- data[-(1:m), 1]
x_test <- as.matrix(data[-(1:m), 2:3])
y <- data[, "y"]
x <- as.matrix(data[, 2:3])
fit <- glm(y ~ 0 + x, family="binomial")
coef <- coef(fit)
log_predict <- predict(fit, type="response")
sigma <- sqrt(1/(m-1) * sum((y_train - log_predict)^2))
Jvalue <- t(x_train) %*% x_train/m * sigma^2
Jsroot <- sqroot(Jvalue)
fvalue <- fa(m, N, a)
score1 <- apply((x_test * as.vector((y_test - x_test %*% coef))), 2, cumsum)
statvalue1 <- t(solve(Jsroot) %*% t(score1))/fvalue/sqrt(m)
statmax1 <- pmax(abs(statvalue1[, 1]), abs(statvalue1[, 2]))
result <- unname(statmax1)
return(result)
}
replicate is a cousin of sapply, designed for repeated evaluation. In the call we just sample the rows 999 times and already get a matrix back. As in sapply we need to transform our result.
res <- t(replicate(999, stat2(data2[sample(1:nrow(data2), nrow(data2), replace=TRUE), ], m, N, a)))
Result
As result we get 999 bootstrap replications in the rows with 100 attributes in the columns.
str(res)
# num [1:999, 1:100] 0.00205 0.38486 0.10146 0.12726 0.47056 ...
The code also runs quite fast.
user system elapsed
3.46 0.01 3.49
Note, that there are different kinds of bootstrapping. E.g. sometimes just a part of the sample is resampled, weights are used, clustering is applied etc. Since you attempted to use boot the method shown should be the default, though.

Gradient descent function with output plot and regression line

I've been running the following code which returns the correct coefficients. However, no matter where I put a plot call, I can't get any plot output.
I'm not sure if a reproducible example is needed here, as I think this can be solved by looking at my gradientDescent function below? It's my first attempt at running this algorithm in R:
gradientDescent <- function(x, y, learn_rate, conv_threshold, n, max_iter) {
m <- runif(1, 0, 1)
c <- runif(1, 0, 1)
yhat <- m * x + c
cost_error <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
converged = F
iterations = 0
while(converged == F) {
m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x)))
c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y)))
m <- m_new
c <- c_new
yhat <- m * x + c
cost_error_new <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
if(cost_error - cost_error_new <= conv_threshold) {
converged = T
}
iterations = iterations + 1
if(iterations > max_iter) {
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
}
It's unclear what you have been doing that was ineffective. The base graphics functions plot and abline should be able to produce output even when used inside functions. Lattice and ggplot2 graphics are based on grid-grpahics and would therefore need a print() wrapped around the function calls to create output (as described in the R-FAQ). So try this:
gradientDescent <- function(x, y, learn_rate, conv_threshold, n, max_iter)
{ ## plot.new() perhaps not needed
plot(x,y)
m <- runif(1, 0, 1)
c <- runif(1, 0, 1)
yhat <- m * x + c
cost_error <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
converged = F
iterations = 0
while(converged == F) {
m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x)))
c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y)))
m <- m_new
c <- c_new
yhat <- m * x + c
cost_error_new <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
if(cost_error - cost_error_new <= conv_threshold) {
converged = T
}
iterations = iterations + 1
if(iterations > max_iter) { abline( c, m) #calculated
dev.off()
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
}

R Gibbs Sampler for Bayesian Regression

I am trying to code a Gibbs sampler for a Bayesian regression model in R, and I am having trouble running my code. It seems there is something going on with the beta in the sigma.update function. When I run the code I get an error that says " Error in x %*% beta : non-conformable arguments" Here is what my code looks like:
x0 <- rep(1, 1000)
x1 <- rnorm(1000, 5, 7)
x <- cbind(x0, x1)
true_error <- rnorm(1000, 0, 2)
true_beta <- c(1.1, -8.2)
y <- x %*% true_beta + true_error
beta0 <- c(1, 1)
sigma0 <- 1
a <- b <- 1
burnin <- 0
thin <- 1
n <- 100
gibbs <- function(n.sims, beta.start, a, b,
y, x, burnin, thin) {
beta.draws <- matrix(NA, nrow=n.sims, ncol=1)
sigma.draws<- c()
beta.cur <- beta.start
sigma.update <- function(a,b, beta, y, x) {
1 / rgamma(1, a + ((length(x)) / 2),
b + (1 / 2) %*% (t(y - x %*% beta) %*% (y - x %*% beta)))
}
beta.update <- function(x, y, sigma) {
rnorm(1, (solve(t(x) %*% x) %*% t(x) %*% y),
sigma^2 * (solve(t(x) %*%x)))
}
for (i in 1:n.sims) {
sigma.cur <- sigma.update(a, b, beta.cur, y, x)
beta.cur <- beta.update(x, y, sigma.cur)
if (i > burnin & (i - burnin) %% thin == 0) {
sigma.draws[(i - burnin) / thin ] <- sigma.cur
beta.draws[(i - burnin) / thin,] <- beta.cur
}
}
return (list(sigma.draws, beta.draws) )
}
gibbs(n, beta0, a, b, y, x, burnin, thin)
The function beta.update is not correct, it returns NaN. You are defining a matrix in the argument sd that is passed to rnorm, a vector is expected in this argument. I think what you are trying to do could be done in this way:
beta.update <- function(x, y, sigma) {
rn <- rnorm(n=2, mean=0, sd=sigma)
xtxinv <- solve(crossprod(x))
as.vector(xtxinv %*% crossprod(x, y)) + xtxinv %*% rn
}
Notice that you are computing some elements that are fixed at all iterations. For example, you could define t(x) %*% x once and pass this element as argument to other functions. In this way you avoid doing these operations at every iteration, saving some computations and probably some time.
Edit
Based on your code, this is what I do:
x0 <- rep(1, 1000)
x1 <- rnorm(1000, 5, 7)
x <- cbind(x0, x1)
true_error <- rnorm(1000, 0, 2)
true_beta <- c(1.1, -8.2)
y <- x %*% true_beta + true_error
beta0 <- c(1, 1)
sigma0 <- 1
a <- b <- 1
burnin <- 0
thin <- 1
n <- 100
gibbs <- function(n.sims, beta.start, a, b, y, x, burnin, thin)
{
beta.draws <- matrix(NA, nrow=n.sims, ncol=2)
sigma.draws<- c()
beta.cur <- beta.start
sigma.update <- function(a,b, beta, y, x) {
1 / rgamma(1, a + ((length(x)) / 2),
b + (1 / 2) %*% (t(y - x %*% beta) %*% (y - x %*% beta)))
}
beta.update <- function(x, y, sigma) {
rn <- rnorm(n=2, mean=0, sd=sigma)
xtxinv <- solve(crossprod(x))
as.vector(xtxinv %*% crossprod(x, y)) + xtxinv %*% rn
}
for (i in 1:n.sims) {
sigma.cur <- sigma.update(a, b, beta.cur, y, x)
beta.cur <- beta.update(x, y, sigma.cur)
if (i > burnin & (i - burnin) %% thin == 0) {
sigma.draws[(i - burnin) / thin ] <- sigma.cur
beta.draws[(i - burnin) / thin,] <- beta.cur
}
}
return (list(sigma.draws, beta.draws) )
}
And this is what I get:
set.seed(123)
res <- gibbs(n, beta0, a, b, y, x, burnin, thin)
head(res[[1]])
# [1] 3015.256257 13.632748 1.950697 1.861225 1.928381 1.884090
tail(res[[1]])
# [1] 1.887497 1.915900 1.984031 2.010798 1.888575 1.994850
head(res[[2]])
# [,1] [,2]
# [1,] 7.135294 -8.697288
# [2,] 1.040720 -8.193057
# [3,] 1.047058 -8.193531
# [4,] 1.043769 -8.193183
# [5,] 1.043766 -8.193279
# [6,] 1.045247 -8.193356
tail(res[[2]])
# [,1] [,2]
# [95,] 1.048501 -8.193550
# [96,] 1.037859 -8.192848
# [97,] 1.045809 -8.193377
# [98,] 1.045611 -8.193374
# [99,] 1.038800 -8.192880
# [100,] 1.047063 -8.193479

Why is my logistic regression implementation so slow?

Here is an implementation of batch gradient descent algorithm in R (theoretical details here):
logreg = function(y, x) {
x = as.matrix(x)
x = apply(x, 2, scale)
x = cbind(1, x)
m = nrow(x)
n = ncol(x)
alpha = 2/m
# b = matrix(rnorm(n))
# b = matrix(summary(lm(y~x))$coef[, 1])
b = matrix(rep(0, n))
h = 1 / (1 + exp(-x %*% b))
J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
derivJ = t(x) %*% (h-y)
niter = 0
while(1) {
niter = niter + 1
newb = b - alpha * derivJ
h = 1 / (1 + exp(-x %*% newb))
newJ = -(t(y) %*% log(h) + t(0-y) %*% log(1 -h))
while((newJ - J) >= 0) {
print("inner while...")
# step adjust
alpha = alpha / 1.15
newb = b - alpha * derivJ
h = 1 / (1 + exp(-x %*% newb))
newJ = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
}
if(max(abs(b - newb)) < 0.001) {
break
}
b = newb
J = newJ
derivJ = t(x) %*% (h-y)
}
b
v = exp(-x %*% b)
h = 1 / (1 + v)
w = h^2 * v
# # hessian matrix of cost function
hess = t(x) %*% diag(as.vector(w)) %*% x
seMat = sqrt(diag(solve(hess)))
zscore = b / seMat
cbind(b, zscore)
}
nr = 5000
nc = 3
# set.seed(17)
x = matrix(rnorm(nr*nc, 0, 999), nr)
x = apply(x, 2, scale)
# y = matrix(sample(0:1, nr, repl=T), nr)
h = 1/(1 + exp(-x %*% rnorm(nc)))
y = round(h)
y[1:round(nr/2)] = sample(0:1, round(nr/2), repl=T)
testglm = function() {
for(i in 1:20) {
res = summary(glm(y~x, family=binomial))$coef
}
print(res)
}
testlogreg = function() {
for(i in 1:20) {
res = logreg(y, x)
}
print(res)
}
print(system.time(testlogreg()))
print(system.time(testglm()))
The algorithm gives me correct results, but it's ten times slower.
print(system.time(testlogreg()))
[,1] [,2]
[1,] -0.0358877 -1.16332
[2,] 0.1904964 6.09873
[3,] -0.1428953 -4.62629
[4,] -0.9151143 -25.33478
user system elapsed
4.013 1.037 5.062
#////////////////////////////////////////////////////
print(system.time(testglm()))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0360447 0.0308617 -1.16794 2.42829e-01
x1 0.1912254 0.0312500 6.11922 9.40373e-10
x2 -0.1432585 0.0309001 -4.63618 3.54907e-06
x3 -0.9178177 0.0361598 -25.38226 3.95964e-142
user system elapsed
0.482 0.040 0.522
But if I don't calculate the standard error and z value, then it's a little faster than glm:
#////////////////////////////////////////////////////
print(system.time(testlogreg()))
[,1]
[1,] -0.0396199
[2,] 0.2281502
[3,] -0.3941912
[4,] 0.8456839
user system elapsed
0.404 0.001 0.405
#////////////////////////////////////////////////////
print(system.time(testglm()))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0397529 0.0309169 -1.28580 1.98514e-01
x1 0.2289063 0.0312998 7.31336 2.60551e-13
x2 -0.3956140 0.0319847 -12.36884 3.85328e-35
x3 0.8483669 0.0353760 23.98144 4.34358e-127
user system elapsed
0.474 0.000 0.475
So apparently the calculation of se and z-value takes a lot of time, but how does glm do it? How can I improve my implementation?
Finally figured this out, the secret lies in the use of sparse matrix (see also this blog post).
require(Matrix)
logreg = function(y, x) {
x = as.matrix(x)
x = apply(x, 2, scale)
x = cbind(1, x)
m = nrow(x)
n = ncol(x)
alpha = 2/m
# b = matrix(rnorm(n))
# b = matrix(summary(lm(y~x))$coef[, 1])
b = matrix(rep(0, n))
v = exp(-x %*% b)
h = 1 / (1 + v)
J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
derivJ = t(x) %*% (h-y)
derivThresh = 0.0000001
bThresh = 0.001
while(1) {
newb = b - alpha * derivJ
if(max(abs(b - newb)) < bThresh) {
break
}
v = exp(-x %*% newb)
h = 1 / (1 + v)
newderivJ = t(x) %*% (h-y)
if(max(abs(newderivJ - derivJ)) < derivThresh) {
break
}
newJ = -(t(y) %*% log(h) + t(0-y) %*% log(1 -h))
if(newJ > J) {
alpha = alpha/2
}
b = newb
J = newJ
derivJ = newderivJ
}
w = h^2 * v
# # hessian matrix of cost function
hess = t(x) %*% Diagonal(x = as.vector(w)) %*% x
seMat = sqrt(diag(solve(hess)))
zscore = b / seMat
cbind(b, zscore)
}
nr = 5000
nc = 2
# set.seed(17)
x = matrix(rnorm(nr*nc, 3, 9), nr)
# x = apply(x, 2, scale)
# y = matrix(sample(0:1, nr, repl=T), nr)
h = 1/(1 + exp(-x %*% rnorm(nc)))
y = round(h)
y[1:round(nr/2)] = sample(0:1, round(nr/2), repl=T)
ntests = 13
testglm = function() {
nr = 5000
nc = 2
# set.seed(17)
x = matrix(rnorm(nr*nc, 3, 9), nr)
# x = apply(x, 2, scale)
# y = matrix(sample(0:1, nr, repl=T), nr)
h = 1/(1 + exp(-x %*% rnorm(nc)))
y = round(h)
y[1:round(nr/2)] = sample(0:1, round(nr/2), repl=T)
for(i in 1:ntests) {
res = summary(glm(y~x, family=binomial))$coef[, c(1, 3)]
}
res
}
testlogreg = function() {
nr = 5000
nc = 2
# set.seed(17)
x = matrix(rnorm(nr*nc, 3, 9), nr)
# x = apply(x, 2, scale)
# y = matrix(sample(0:1, nr, repl=T), nr)
h = 1/(1 + exp(-x %*% rnorm(nc)))
y = round(h)
y[1:round(nr/2)] = sample(0:1, round(nr/2), repl=T)
for(i in 1:ntests) {
res = logreg(y, x)
}
res
}
print(system.time(testlogreg()))
print(system.time(testglm()))
Now my implementation is even a bit faster than the glm in R!
print(system.time(testlogreg()))
[,1] [,2]
[1,] -0.022598 -0.739494
[2,] -0.510799 -15.793676
[3,] -0.130177 -4.257121
[4,] 0.578318 17.712392
[5,] 0.299080 9.587985
[6,] 0.244131 7.888600
user system elapsed
8.954 0.044 9.000
#////////////////////////////////////////////////////
print(system.time(testglm()))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0226784 0.0305694 -0.741865 4.58169e-01
x1 -0.5129285 0.0323621 -15.849653 1.41358e-56
x2 -0.1305872 0.0305892 -4.269057 1.96301e-05
x3 0.5806001 0.0326719 17.770648 1.19304e-70
x4 0.3002898 0.0312072 9.622454 6.42789e-22
x5 0.2451543 0.0309601 7.918407 2.40573e-15
user system elapsed
12.218 0.008 12.229

Writing a function for the Cramer Von Mises test

The cvm.test() from dgof package provides a way of doing the one-sample Cramer-von Mises test on discrete distributions, my goal is to develop a function that does the test for continuous distributions as well (like the Kolmogorov-Smirnov ks.test() from the stats package).
Note:this post is concerned only with fully specified df null hypothesis, so please no bootstraping or Monte Carlo Simulation here
> cvm.test
function (x, y, type = c("W2", "U2", "A2"), simulate.p.value = FALSE,
B = 2000, tol = 1e-08)
{
cvm.pval.disc <- function(STAT, lambda) {
x <- STAT
theta <- function(u) {
VAL <- 0
for (i in 1:length(lambda)) {
VAL <- VAL + 0.5 * atan(lambda[i] * u)
}
return(VAL - 0.5 * x * u)
}
rho <- function(u) {
VAL <- 0
for (i in 1:length(lambda)) {
VAL <- VAL + log(1 + lambda[i]^2 * u^2)
}
VAL <- exp(VAL * 0.25)
return(VAL)
}
fun <- function(u) return(sin(theta(u))/(u * rho(u)))
pval <- 0
try(pval <- 0.5 + integrate(fun, 0, Inf, subdivisions = 1e+06)$value/pi,
silent = TRUE)
if (pval > 0.001)
return(pval)
if (pval <= 0.001) {
df <- sum(lambda != 0)
est1 <- dchisq(STAT/max(lambda), df)
logf <- function(t) {
ans <- -t * STAT
ans <- ans - 0.5 * sum(log(1 - 2 * t * lambda))
return(ans)
}
est2 <- 1
try(est2 <- exp(nlm(logf, 1/(4 * max(lambda)))$minimum),
silent = TRUE)
return(min(est1, est2))
}
}
cvm.stat.disc <- function(x, y, type = c("W2", "U2", "A2")) {
type <- match.arg(type)
I <- knots(y)
N <- length(x)
e <- diff(c(0, N * y(I)))
obs <- rep(0, length(I))
for (j in 1:length(I)) {
obs[j] <- length(which(x == I[j]))
}
S <- cumsum(obs)
T <- cumsum(e)
H <- T/N
p <- e/N
t <- (p + p[c(2:length(p), 1)])/2
Z <- S - T
Zbar <- sum(Z * t)
S0 <- diag(p) - p %*% t(p)
A <- matrix(1, length(p), length(p))
A <- apply(row(A) >= col(A), 2, as.numeric)
E <- diag(t)
One <- rep(1, nrow(E))
K <- diag(0, length(H))
diag(K)[-length(H)] <- 1/(H[-length(H)] * (1 - H[-length(H)]))
Sy <- A %*% S0 %*% t(A)
M <- switch(type, W2 = E, U2 = (diag(1, nrow(E)) - E %*%
One %*% t(One)) %*% E %*% (diag(1, nrow(E)) - One %*%
t(One) %*% E), A2 = E %*% K)
lambda <- eigen(M %*% Sy)$values
STAT <- switch(type, W2 = sum(Z^2 * t)/N, U2 = sum((Z -
Zbar)^2 * t)/N, A2 = sum((Z^2 * t/(H * (1 - H)))[-length(I)])/N)
return(c(STAT, lambda))
}
cvm.pval.disc.sim <- function(STATISTIC, lambda, y, type,
tol, B) {
knots.y <- knots(y)
fknots.y <- y(knots.y)
u <- runif(B * length(x))
u <- sapply(u, function(a) return(knots.y[sum(a > fknots.y) +
1]))
dim(u) <- c(B, length(x))
s <- apply(u, 1, cvm.stat.disc, y, type)
s <- s[1, ]
return(sum(s >= STATISTIC - tol)/B)
}
type <- match.arg(type)
DNAME <- deparse(substitute(x))
if (is.stepfun(y)) {
if (length(setdiff(x, knots(y))) != 0) {
stop("Data are incompatable with null distribution; ",
"Note: This function is meant only for discrete distributions ",
"you may be receiving this error because y is continuous.")
}
tempout <- cvm.stat.disc(x, y, type = type)
STAT <- tempout[1]
lambda <- tempout[2:length(tempout)]
if (!simulate.p.value) {
PVAL <- cvm.pval.disc(STAT, lambda)
}
else {
PVAL <- cvm.pval.disc.sim(STAT, lambda, y, type,
tol, B)
}
METHOD <- paste("Cramer-von Mises -", type)
names(STAT) <- as.character(type)
RVAL <- list(statistic = STAT, p.value = PVAL, alternative = "Two.sided",
method = METHOD, data.name = DNAME)
}
else {
stop("Null distribution must be a discrete.")
}
class(RVAL) <- "htest"
return(RVAL)
}
<environment: namespace:dgof>
Kolmogorov-Smirnov ks.test() from stats package for comparison (note that this function does both the one-sample and two-sample tests):
> ks.test
function (x, y, ..., alternative = c("two.sided", "less", "greater"),
exact = NULL, tol = 1e-08, simulate.p.value = FALSE, B = 2000)
{
pkolmogorov1x <- function(x, n) {
if (x <= 0)
return(0)
if (x >= 1)
return(1)
j <- seq.int(from = 0, to = floor(n * (1 - x)))
1 - x * sum(exp(lchoose(n, j) + (n - j) * log(1 - x -
j/n) + (j - 1) * log(x + j/n)))
}
exact.pval <- function(alternative, STATISTIC, x, n, y, knots.y,
tol) {
ts.pval <- function(S, x, n, y, knots.y, tol) {
f_n <- ecdf(x)
eps <- min(tol, min(diff(knots.y)) * tol)
eps2 <- min(tol, min(diff(y(knots.y))) * tol)
a <- rep(0, n)
b <- a
f_a <- a
for (i in 1:n) {
a[i] <- min(c(knots.y[which(y(knots.y) + S >=
i/n + eps2)[1]], Inf), na.rm = TRUE)
b[i] <- min(c(knots.y[which(y(knots.y) - S >
(i - 1)/n - eps2)[1]], Inf), na.rm = TRUE)
f_a[i] <- ifelse(!(a[i] %in% knots.y), y(a[i]),
y(a[i] - eps))
}
f_b <- y(b)
p <- rep(1, n + 1)
for (i in 1:n) {
tmp <- 0
for (k in 0:(i - 1)) {
tmp <- tmp + choose(i, k) * (-1)^(i - k - 1) *
max(f_b[k + 1] - f_a[i], 0)^(i - k) * p[k +
1]
}
p[i + 1] <- tmp
}
p <- max(0, 1 - p[n + 1])
if (p > 1) {
warning("numerical instability in p-value calculation.")
p <- 1
}
return(p)
}
less.pval <- function(S, n, H, z, tol) {
m <- ceiling(n * (1 - S))
c <- S + (1:m - 1)/n
CDFVAL <- H(sort(z))
for (j in 1:length(c)) {
ifelse((min(abs(c[j] - CDFVAL)) < tol), c[j] <- 1 -
c[j], c[j] <- 1 - CDFVAL[which(order(c(c[j],
CDFVAL)) == 1)])
}
b <- rep(0, m)
b[1] <- 1
for (k in 1:(m - 1)) b[k + 1] <- 1 - sum(choose(k,
1:k - 1) * c[1:k]^(k - 1:k + 1) * b[1:k])
p <- sum(choose(n, 0:(m - 1)) * c^(n - 0:(m - 1)) *
b)
return(p)
}
greater.pval <- function(S, n, H, z, tol) {
m <- ceiling(n * (1 - S))
c <- 1 - (S + (1:m - 1)/n)
CDFVAL <- c(0, H(sort(z)))
for (j in 1:length(c)) {
if (!(min(abs(c[j] - CDFVAL)) < tol))
c[j] <- CDFVAL[which(order(c(c[j], CDFVAL)) ==
1) - 1]
}
b <- rep(0, m)
b[1] <- 1
for (k in 1:(m - 1)) b[k + 1] <- 1 - sum(choose(k,
1:k - 1) * c[1:k]^(k - 1:k + 1) * b[1:k])
p <- sum(choose(n, 0:(m - 1)) * c^(n - 0:(m - 1)) *
b)
return(p)
}
p <- switch(alternative, two.sided = ts.pval(STATISTIC,
x, n, y, knots.y, tol), less = less.pval(STATISTIC,
n, y, knots.y, tol), greater = greater.pval(STATISTIC,
n, y, knots.y, tol))
return(p)
}
sim.pval <- function(alternative, STATISTIC, x, n, y, knots.y,
tol, B) {
fknots.y <- y(knots.y)
u <- runif(B * length(x))
u <- sapply(u, function(a) return(knots.y[sum(a > fknots.y) +
1]))
dim(u) <- c(B, length(x))
getks <- function(a, knots.y, fknots.y) {
dev <- c(0, ecdf(a)(knots.y) - fknots.y)
STATISTIC <- switch(alternative, two.sided = max(abs(dev)),
greater = max(dev), less = max(-dev))
return(STATISTIC)
}
s <- apply(u, 1, getks, knots.y, fknots.y)
return(sum(s >= STATISTIC - tol)/B)
}
alternative <- match.arg(alternative)
DNAME <- deparse(substitute(x))
x <- x[!is.na(x)]
n <- length(x)
if (n < 1L)
stop("not enough 'x' data")
PVAL <- NULL
if (is.numeric(y)) {
DNAME <- paste(DNAME, "and", deparse(substitute(y)))
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
if (n.y < 1L)
stop("not enough 'y' data")
if (is.null(exact))
exact <- (n.x * n.y < 10000)
METHOD <- "Two-sample Kolmogorov-Smirnov test"
TIES <- FALSE
n <- n.x * n.y/(n.x + n.y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
if (length(unique(w)) < (n.x + n.y)) {
warning("cannot compute correct p-values with ties")
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)]
TIES <- TRUE
}
STATISTIC <- switch(alternative, two.sided = max(abs(z)),
greater = max(z), less = -min(z))
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below that of y", greater = "the CDF of x lies above that of y")
if (exact && (alternative == "two.sided") && !TIES)
PVAL <- 1 - .C("psmirnov2x", p = as.double(STATISTIC),
as.integer(n.x), as.integer(n.y), PACKAGE = "dgof")$p
}
else if (is.stepfun(y)) {
z <- knots(y)
if (is.null(exact))
exact <- (n <= 30)
if (exact && n > 30) {
warning("numerical instability may affect p-value")
}
METHOD <- "One-sample Kolmogorov-Smirnov test"
dev <- c(0, ecdf(x)(z) - y(z))
STATISTIC <- switch(alternative, two.sided = max(abs(dev)),
greater = max(dev), less = max(-dev))
if (simulate.p.value) {
PVAL <- sim.pval(alternative, STATISTIC, x, n, y,
z, tol, B)
}
else {
PVAL <- switch(exact, `TRUE` = exact.pval(alternative,
STATISTIC, x, n, y, z, tol), `FALSE` = NULL)
}
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below the null hypothesis",
greater = "the CDF of x lies above the null hypothesis")
}
else {
if (is.character(y))
y <- get(y, mode = "function")
if (mode(y) != "function")
stop("'y' must be numeric or a string naming a valid function")
if (is.null(exact))
exact <- (n < 100)
METHOD <- "One-sample Kolmogorov-Smirnov test"
TIES <- FALSE
if (length(unique(x)) < n) {
warning(paste("default ks.test() cannot compute correct p-values with ties;\n",
"see help page for one-sample Kolmogorov test for discrete distributions."))
TIES <- TRUE
}
x <- y(sort(x), ...) - (0:(n - 1))/n
STATISTIC <- switch(alternative, two.sided = max(c(x,
1/n - x)), greater = max(1/n - x), less = max(x))
if (exact && !TIES) {
PVAL <- if (alternative == "two.sided")
1 - .C("pkolmogorov2x", p = as.double(STATISTIC),
as.integer(n), PACKAGE = "dgof")$p
else 1 - pkolmogorov1x(STATISTIC, n)
}
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below the null hypothesis",
greater = "the CDF of x lies above the null hypothesis")
}
names(STATISTIC) <- switch(alternative, two.sided = "D",
greater = "D^+", less = "D^-")
pkstwo <- function(x, tol = 1e-06) {
if (is.numeric(x))
x <- as.vector(x)
else stop("argument 'x' must be numeric")
p <- rep(0, length(x))
p[is.na(x)] <- NA
IND <- which(!is.na(x) & (x > 0))
if (length(IND)) {
p[IND] <- .C("pkstwo", as.integer(length(x[IND])),
p = as.double(x[IND]), as.double(tol), PACKAGE = "dgof")$p
}
return(p)
}
if (is.null(PVAL)) {
PVAL <- ifelse(alternative == "two.sided", 1 - pkstwo(sqrt(n) *
STATISTIC), exp(-2 * n * STATISTIC^2))
}
RVAL <- list(statistic = STATISTIC, p.value = PVAL, alternative = nm_alternative,
method = METHOD, data.name = DNAME)
class(RVAL) <- "htest"
return(RVAL)
}
<environment: namespace:dgof>

Resources