Gradient descent function with output plot and regression line - r

I've been running the following code which returns the correct coefficients. However, no matter where I put a plot call, I can't get any plot output.
I'm not sure if a reproducible example is needed here, as I think this can be solved by looking at my gradientDescent function below? It's my first attempt at running this algorithm in R:
gradientDescent <- function(x, y, learn_rate, conv_threshold, n, max_iter) {
m <- runif(1, 0, 1)
c <- runif(1, 0, 1)
yhat <- m * x + c
cost_error <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
converged = F
iterations = 0
while(converged == F) {
m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x)))
c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y)))
m <- m_new
c <- c_new
yhat <- m * x + c
cost_error_new <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
if(cost_error - cost_error_new <= conv_threshold) {
converged = T
}
iterations = iterations + 1
if(iterations > max_iter) {
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
}

It's unclear what you have been doing that was ineffective. The base graphics functions plot and abline should be able to produce output even when used inside functions. Lattice and ggplot2 graphics are based on grid-grpahics and would therefore need a print() wrapped around the function calls to create output (as described in the R-FAQ). So try this:
gradientDescent <- function(x, y, learn_rate, conv_threshold, n, max_iter)
{ ## plot.new() perhaps not needed
plot(x,y)
m <- runif(1, 0, 1)
c <- runif(1, 0, 1)
yhat <- m * x + c
cost_error <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
converged = F
iterations = 0
while(converged == F) {
m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x)))
c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y)))
m <- m_new
c <- c_new
yhat <- m * x + c
cost_error_new <- (1 / (n + 2)) * sum((y - yhat) ^ 2)
if(cost_error - cost_error_new <= conv_threshold) {
converged = T
}
iterations = iterations + 1
if(iterations > max_iter) { abline( c, m) #calculated
dev.off()
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
}

Related

Non numeric matrix extent in r

mR <- readRDS("mData_retake-kopi.r")
LL <- function(theta, mR){
if (alpha + beta != 1 && 0 > alpha && alpha < 1 && v < 2) {
cat("error: Constraint doesn't hold")
return(NULL)
}
alpha <- theta[1]
beta <- theta[2]
mu <- theta[3]
v <- theta[4]
N <- dim(mR)[1]
sig <- matrix(0,N,2)
sig[1] <- var(mR)
ll <- matrix(0,N-1)
for (n in seq(2,N)){
sig[n,1] <- alpha * mR[n-1] + beta * sig[n-1]
ll[n-1] <- (gamma( (v+1) / 2) / sqrt(v * pi) * gamma( v / 2 )) * ( 1 + (e^2 / v))^(-(v+1)/2)
}
return( -mean(ll) )
}
theta <- c(0.90, 0.10, 0, 5)
LL(theta,mR)
It returns
Error in matrix(0, nrow = unlist(N), ncol = 2) : non-numeric matrix
extent
Having an issue with mR.
I believe I have defined it by reading my data, but it keeps returning non-numeric matrix extent in the "sig" part.
I need to find the negative average log-likelihood function.

Fast and accurate computation of studentized external residuals in R

I want to compute the external studentized residuals of a dataset {x,y} of size n in R given the following constraints:
(very) high precision
high performance (avoiding loops where possible)
R language (including RCPP)
The R code should be fast because it will be used extensively (10^9 times minimum) on multiple data sets with n in [10^3, 10^6]. This question is part of a larger work for estimating a custom statistic that requires the studentized residuals. The most computational part is the one presented here. Thus, solving this would dramatically improve the overall efficiency.
On the lm() regression
To gather the studentized external residuals, one typically runs lm() then rstudent(). The R function uses an aproach that avoid running n regressions for estimating the studentized residuals and that saves a lot of execution time. However, I prefer not to use lm() because I only need the residuals without all that fancy additional stuff that comes with it (thus saving some more execution time).
When trying to decipher the R source code for the external residuals in the lm() I found it somewhat obscur, as it seems to call sample code from other external files (an example is the influence() function). Thus, at this time I failed at collecting enough information to replicate the code section using the source code only.
Relevant topic(s) on Stack
The following relevant topic has been found in Stack: How to compute Studentized Residuals in Python?
A R implementation of the Python procedure including a minimal example is given (corrected by #Stéphane Laurent, see answers):
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
m = 2
mean_y = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_y) %*% (y - mean_y)
beta_1 = ((y - mean_y) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_y
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_y) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
reg = rstudent(lm(x ~ y))
res = cbind(reg, studentized_residuals)
Produce the following differences:
index reg studentized_residuals
1 -0,595911898846465 -0,581348373714385
2 0,116208945967327 0,116097011762269
3 -2,04779452591111 -1,61939642040734
4 2,26350621688535 1,71995630000724
5 0,603322309518977 0,588222428131761
6 -1,5460639774285 -1,33486217871738
7 0,367900050364855 0,364393996552621
8 1,14745971090533 1,05271762293388
9 0,823888320713653 0,786630743176311
10 -0,449839343257121 -0,443475039943641
Minimal example
The following R attemp has been tested using arbitrary datasets, just for illustration purposes.
It uses lm() / rstudent() and is way too slow for our practical application. The two parameters n1 and n2 correspond to the number of iterations and the size of the vector (denoted n in the above) respectively. To match our problem, we would typically pick n1 in [10^6, 10^9] and n2 in [10^3, 10^6] :
Stud = function(n1, n2){
res = data.frame(matrix(vector(), n2, n1))
for(i in 1 : n1){
x = rnorm(n2)
y = rnorm(n2)
reg = lm(x ~ y)
res[, i] = rstudent(reg)
}
}
Update and additional (full) minimal example for benchmark:
Here we show a complete benchmark where various functions of Stack are tested against lm() in the objective of gathering the studentized externals residuals. For gathering these residuals we need to run 'n' regressions. Results are given after the code for 100 and 500 replications.
#Packages
install.packages("Rcpp")
library(Rcpp)
install.packages("RcppArmadillo")
library(RcppArmadillo)
install.packages("RcppEigen")
library(RcppEigen)
install.packages("stats")
library(stats)
install.packages("speedglm")
library(speedglm)
install.packages("Rfast")
library(Rfast)
install.packages("rbenchmark")
library(rbenchmark)
## start from SEXP, most conversions, longest code
src <- '
Rcpp::List fLmSEXP(SEXP Xs, SEXP ys) {
Rcpp::NumericMatrix Xr(Xs);
Rcpp::NumericVector yr(ys);
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Rcpp types are early RcppArmadillo examples did
src <- '
Rcpp::List fLmTwoCasts(Rcpp::NumericMatrix Xr, Rcpp::NumericVector yr) {
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types
src <- '
Rcpp::List fLmOneCast(arma::mat X, arma::colvec y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types passed as constant references
src <- '
Rcpp::List fLmConstRef(const arma::mat & X, const arma::colvec & y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
#Benchmark
data = benchmark("OneCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmOneCast(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"TwoCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmTwoCasts(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Const" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmConstRef(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Sexp" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmSEXP(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fastLm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Speed" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- speedlm(x[-i] ~ y[-i], fitted = T)
sum((x[-i] - fit$fitted.values)^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
".Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lmfit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Lm" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Basic" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
reg <- lm(x ~ y)
reg_stud <- rstudent(reg)
},
replications = 500,
columns = c("test", "elapsed", "replications"))
Results:
On this single benchmark, the rstudent(lm()) is much faster than everything else:
test elapsed replications
7 .Fit 13.84 100
10 Basic 0.25 100
3 Const 7.37 100
5 Fast 99.84 100
8 Fit 7.06 100
9 Lm 105.25 100
1 OneCast 7.61 100
4 Sexp 7.66 100
6 Speed 184.76 100
2 TwoCast 7.17 100
7 .Fit 63.63 500
10 Basic 0.93 500
3 Const 34.44 500
5 Fast 438.95 500
8 Fit 31.11 500
9 Lm 471.37 500
1 OneCast 34.29 500
4 Sexp 33.48 500
6 Speed 794.73 500
2 TwoCast 33.51 500
Interpretation
It seems that R uses an analytical alternative that avoid using 'n' regressions, resulting in a much faster computation.
Thus, the question still remains: How to be competitive in regards to rstudent(lm()), and how to reverse-engeering the original source code (that is difficult to gather) ?
Final results
We compared the solutions of #Onyambu, #tester and #Stéphane Laurent. We found the solution of #Onyambu to be the fastest one for different vector sizes, while providing results exactly equal to those of rstudent().
One gets the same results by replacing your var_e with
var_e = vapply(1:n, function(i){
sigma(lm(x[-i] ~ y[-i]))
}, numeric(1))
To get that efficiently, do not use lm but lm.fit:
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
EDIT:
the edit is to indicate that a faster_rstudent function than the previously give was found:
fast_rstudent <-function(X, y, intercept = TRUE){
mqr <- .Call(stats:::C_Cdqrls, cbind(intercept, X), y, tol, FALSE)
res <- .Call(stats:::C_influence, mqr, mqr$residuals, 1e-12)
mqr$residuals/(res$sigma*sqrt(1-res$hat))
}
So far this function is very fast.
Previous Answer
Since you are using R, you could use a qr decomposition to solve this. Your aim is to write a rstudent function that is faster than the inbuilt function by getting rid of the overhead function calls etc. That means that you should only use the necessary internal functions. Below is a quick way to do this:
my_rstudent <- function (X, y, intercept = TRUE) {
X <- cbind(intercept, X)
u <- .Call(stats:::C_Cdqrls, X, y, 1e-7, FALSE)
d <- dim(X)
n <- as.integer(d[1L])
k <- as.integer(d[2L])
df_res <- n - k
z <- .Internal(diag(1, n, k))
v <- .Fortran(.F_dqrqy, as.double(u$qr), n, k, as.double(u$qraux),
z, k, qy = z)$qy
h_ii <-.Internal(rowSums(v^2, n, k, FALSE))
rstand <- u$residuals/sqrt(sum(u$residuals**2)/df_res)/sqrt(1-h_ii)
rstand * sqrt((df_res - 1)/( df_res - rstand^2))
}
In a way this function misuses R by almost removing the overhead functions entirely. This assumes that what is being given to the function is correct.
Results:
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
cbind(mine=my_rstudent(x, y), from_R=rstudent(lm(y~x)))
mine from_R
1 0.92113157 0.92113157
2 0.15753536 0.15753536
3 -1.69587949 -1.69587949
4 -3.59182456 -3.59182456
5 0.98274664 0.98274664
6 -0.85765961 -0.85765961
7 -0.07768369 -0.07768369
8 1.05874766 1.05874766
9 0.80181623 0.80181623
10 0.11418833 0.11418833
benchmark:
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 100)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
rstudent(lm(y ~ x)) 45.36667 37.20755 26.89753 24.29545 22.39587 11.31733 100
With a small dataset, the overhead functions quit slow down the computation of rstudent.
Relatively large dataset:
n = 1000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 100)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
rstudent(lm(y ~ x)) 8.530228 8.059269 7.700426 7.848123 7.616909 3.877305 100
huge dataset
n = 1000000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10
rstudent(lm(y ~ x)) 1.510198 1.560989 1.486083 1.666609 1.603455 1.01154 10
Very huge dataset
n = 10000000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)),unit="relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x, y) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10
rstudent(lm(y ~ x)) 1.603652 1.603881 1.534455 1.58802 1.560724 1.305315 10
microbenchmark::microbenchmark(my_rstudent(x, y),rstudent(lm(y~x)), times = 10)
Unit: seconds
expr min lq mean median uq max neval
my_rstudent(x, y) 1.584408 1.619822 1.727310 1.658917 1.757311 2.213203 10
rstudent(lm(y ~ x)) 2.458445 2.619609 2.705212 2.696705 2.776588 2.949799 10
I think the solution to your problem will be dropping all necessary overhead for the functions first, if that is not fast enough, try to convert the code to C++ and run it with Rccp. It is very likely that you'll be able to improve on my results, if you compute the residuals from .lm.fit using your own implementation, instead of using lm.fit, as I did.
I also checked, if there's a difference in the studentized residuals depending on the function you are going to use (lm, lm.fit, .lm.fit), it turns out that this is the case. However, the residuals from my function here are equal to those produced by MASS::studres for a regression of y ~ x with x having only one column.
Here's my code and a benchmark versus the fastest version from above called "Basic":
library(rbenchmark)
library(microbenchmark)
library(MASS)
set.seed(1)
x <- matrix(rnorm(500), ncol = 1)
y <- matrix(rnorm(500), ncol = 1)
myFunc <- function(x, y, n = 500){
# tmp <- .lm.fit(x, y) # linear model fit
object <- lm.fit(x = x, y = y)
resid <- object$residuals
hat <- lm.influence(object, do.coef = FALSE)$hat
# hat <- hat[hat > 0] # remove checks
# ok <- !(is.na(resid)) # remove checks
# n.miss <- sum(!ok) # remove checks
# resid <- resid[ok] # remove checks
# n <- length(resid)
# p <- object$rank # equal to one
p <- 1
rdf <- n - 1
studres <- resid
stddev <- sqrt(sum(resid^2)/rdf)
sr <- resid/(sqrt(1 - hat) * stddev)
stdres <- sr
studres <- sr/sqrt((n - p - sr^2)/(n - p - 1))
studres <- naresid(object$na.action, studres)
return(studres)
}
test1 <- stats::rstudent(lm(x ~ y)) # rstudent doesn't work with lm.fit
test2 <- MASS::studres(lm(x ~ y))
test3 <- MASS::studres(lm.fit(x, y))
test4 <- myFunc(x, y, n = 500)
> head(cbind(test1, test2, test3, test4))
test1 test2 test3 test4
1 -0.6368094 -0.6368094 0.04696790 0.04696790
2 0.1493050 0.1493050 -0.27286396 -0.27286396
3 -0.8941217 -0.8941217 -1.15505676 -1.15505676
4 1.5598965 1.5598965 0.07729179 0.07729179
5 0.3440252 0.3440252 0.95155123 0.95155123
6 -0.7714317 -0.7714317 1.47600416 1.47600416
####################################
mbm <- microbenchmark("lm" = {rstudent(lm(y~x)) },
"MASS_lm" = {
MASS::studres(lm(y~x))
},
"MASS_lm.fit" = {
MASS::studres(lm.fit(x = x , y = y))
},
"myFunc" = {myFunc(x, y, n = 500)},
times = 100
)
> mbm
Unit: microseconds
expr min lq mean median uq max neval
lm 767.001 869.1510 1188.023 977.1505 1185.5010 8279.801 100
MASS_lm 704.601 909.2000 1085.261 997.3515 1168.8505 2052.202 100
MASS_lm.fit 168.001 195.0510 282.166 212.9510 254.1015 2912.201 100
myFunc 147.901 168.8015 234.261 190.0010 249.7515 1193.701 100
Please note, that you'll have to specify n according to the length of the vector x or y.

r - foreach unable to find object within function

I have a function defined below named algor, which is translated from MATLAB to R. In order to make the function faster, I am using the foreach construct for the first time. I have the complete function code below:
library("ramify")
library("foreach")
algor <- function (vc) {
# initialize A, ybar, and Ia
A <- 0
ybar <- 0
Ia <- 0
# x is the first column of vc
x <- vc[, 1, drop = FALSE]
# y is the second column of vc
y <- vc[, 2, drop = FALSE]
# n is the length of x
n <- length(x)
foreach(i = 1:(n-1), .combine = 'c', .export = c("A", "ybar", "Ia", "x", "y")) %do% {
A <- A + 0.5 * (x[i] - x[i+1]) * (y[i] + y[i+1])
ybar <- ybar + (1 / 6) * (x[i] - x[i+1]) * (y[i] ^ 2 + y[i] * y[i+1] + y[i+1] ^ 2)
Ia <- Ia + (1 / 12) * (x[i] - x[i+1]) * (y[i] ^ 3 + y[i] ^ 2 * y[i+1] + y[i] * y[i+1] ^ 2 + y[i+1] ^ 3)
}
props <- mat("A, Ia, ybar", eval = TRUE)
return(props)
}
inner <- mat("0, 300; 300, 300; 300, 695; 0, 695; 0, 300")
algor(inner)
Although I have exported A, ybar, Ia, x, and y I am getting an error that the object A is not found, which is below:
Error in eval(parse(text = paste0("c(", paste0(char_vals, collapse = ","), :
object 'A' not found
Called from: eval(parse(text = paste0("c(", paste0(char_vals, collapse = ","),
")")))
How do I get foreach to recognize the defined objects: A, ybar, Ia, x, and y?
Thank you.
Try defining the .GlobalEnv variables within the foreach loop in every call.
library("ramify")
library("foreach")
algor <- function (vc) {
# initialize A, ybar, and Ia
A <- 0
ybar <- 0
Ia <- 0
# x is the first column of vc
x <- vc[, 1, drop = FALSE]
# y is the second column of vc
y <- vc[, 2, drop = FALSE]
# n is the length of x
n <- length(x)
foreach(i = 1:(n-1), .combine = 'c', .export = c("A", "ybar", "Ia", "x", "y")) %do% {
.GlobalEnv$A <- A
.GlobalEnv$ybar <- ybar
.GlobalEnv$Ia <- Ia
A <- A + 0.5 * (x[i] - x[i+1]) * (y[i] + y[i+1])
ybar <- ybar + (1 / 6) * (x[i] - x[i+1]) * (y[i] ^ 2 + y[i] * y[i+1] + y[i+1] ^ 2)
Ia <- Ia + (1 / 12) * (x[i] - x[i+1]) * (y[i] ^ 3 + y[i] ^ 2 * y[i+1] + y[i] * y[i+1] ^ 2 + y[i+1] ^ 3)
}
props <- mat("A, Ia, ybar", eval = TRUE)
return(props)
}
inner <- mat("0, 300; 300, 300; 300, 695; 0, 695; 0, 300")
algor(inner)
This returns:
[,1] [,2] [,3]
[1,] 118500 30870237500 58953750

R Gibbs Sampler for Bayesian Regression

I am trying to code a Gibbs sampler for a Bayesian regression model in R, and I am having trouble running my code. It seems there is something going on with the beta in the sigma.update function. When I run the code I get an error that says " Error in x %*% beta : non-conformable arguments" Here is what my code looks like:
x0 <- rep(1, 1000)
x1 <- rnorm(1000, 5, 7)
x <- cbind(x0, x1)
true_error <- rnorm(1000, 0, 2)
true_beta <- c(1.1, -8.2)
y <- x %*% true_beta + true_error
beta0 <- c(1, 1)
sigma0 <- 1
a <- b <- 1
burnin <- 0
thin <- 1
n <- 100
gibbs <- function(n.sims, beta.start, a, b,
y, x, burnin, thin) {
beta.draws <- matrix(NA, nrow=n.sims, ncol=1)
sigma.draws<- c()
beta.cur <- beta.start
sigma.update <- function(a,b, beta, y, x) {
1 / rgamma(1, a + ((length(x)) / 2),
b + (1 / 2) %*% (t(y - x %*% beta) %*% (y - x %*% beta)))
}
beta.update <- function(x, y, sigma) {
rnorm(1, (solve(t(x) %*% x) %*% t(x) %*% y),
sigma^2 * (solve(t(x) %*%x)))
}
for (i in 1:n.sims) {
sigma.cur <- sigma.update(a, b, beta.cur, y, x)
beta.cur <- beta.update(x, y, sigma.cur)
if (i > burnin & (i - burnin) %% thin == 0) {
sigma.draws[(i - burnin) / thin ] <- sigma.cur
beta.draws[(i - burnin) / thin,] <- beta.cur
}
}
return (list(sigma.draws, beta.draws) )
}
gibbs(n, beta0, a, b, y, x, burnin, thin)
The function beta.update is not correct, it returns NaN. You are defining a matrix in the argument sd that is passed to rnorm, a vector is expected in this argument. I think what you are trying to do could be done in this way:
beta.update <- function(x, y, sigma) {
rn <- rnorm(n=2, mean=0, sd=sigma)
xtxinv <- solve(crossprod(x))
as.vector(xtxinv %*% crossprod(x, y)) + xtxinv %*% rn
}
Notice that you are computing some elements that are fixed at all iterations. For example, you could define t(x) %*% x once and pass this element as argument to other functions. In this way you avoid doing these operations at every iteration, saving some computations and probably some time.
Edit
Based on your code, this is what I do:
x0 <- rep(1, 1000)
x1 <- rnorm(1000, 5, 7)
x <- cbind(x0, x1)
true_error <- rnorm(1000, 0, 2)
true_beta <- c(1.1, -8.2)
y <- x %*% true_beta + true_error
beta0 <- c(1, 1)
sigma0 <- 1
a <- b <- 1
burnin <- 0
thin <- 1
n <- 100
gibbs <- function(n.sims, beta.start, a, b, y, x, burnin, thin)
{
beta.draws <- matrix(NA, nrow=n.sims, ncol=2)
sigma.draws<- c()
beta.cur <- beta.start
sigma.update <- function(a,b, beta, y, x) {
1 / rgamma(1, a + ((length(x)) / 2),
b + (1 / 2) %*% (t(y - x %*% beta) %*% (y - x %*% beta)))
}
beta.update <- function(x, y, sigma) {
rn <- rnorm(n=2, mean=0, sd=sigma)
xtxinv <- solve(crossprod(x))
as.vector(xtxinv %*% crossprod(x, y)) + xtxinv %*% rn
}
for (i in 1:n.sims) {
sigma.cur <- sigma.update(a, b, beta.cur, y, x)
beta.cur <- beta.update(x, y, sigma.cur)
if (i > burnin & (i - burnin) %% thin == 0) {
sigma.draws[(i - burnin) / thin ] <- sigma.cur
beta.draws[(i - burnin) / thin,] <- beta.cur
}
}
return (list(sigma.draws, beta.draws) )
}
And this is what I get:
set.seed(123)
res <- gibbs(n, beta0, a, b, y, x, burnin, thin)
head(res[[1]])
# [1] 3015.256257 13.632748 1.950697 1.861225 1.928381 1.884090
tail(res[[1]])
# [1] 1.887497 1.915900 1.984031 2.010798 1.888575 1.994850
head(res[[2]])
# [,1] [,2]
# [1,] 7.135294 -8.697288
# [2,] 1.040720 -8.193057
# [3,] 1.047058 -8.193531
# [4,] 1.043769 -8.193183
# [5,] 1.043766 -8.193279
# [6,] 1.045247 -8.193356
tail(res[[2]])
# [,1] [,2]
# [95,] 1.048501 -8.193550
# [96,] 1.037859 -8.192848
# [97,] 1.045809 -8.193377
# [98,] 1.045611 -8.193374
# [99,] 1.038800 -8.192880
# [100,] 1.047063 -8.193479

Writing a function for the Cramer Von Mises test

The cvm.test() from dgof package provides a way of doing the one-sample Cramer-von Mises test on discrete distributions, my goal is to develop a function that does the test for continuous distributions as well (like the Kolmogorov-Smirnov ks.test() from the stats package).
Note:this post is concerned only with fully specified df null hypothesis, so please no bootstraping or Monte Carlo Simulation here
> cvm.test
function (x, y, type = c("W2", "U2", "A2"), simulate.p.value = FALSE,
B = 2000, tol = 1e-08)
{
cvm.pval.disc <- function(STAT, lambda) {
x <- STAT
theta <- function(u) {
VAL <- 0
for (i in 1:length(lambda)) {
VAL <- VAL + 0.5 * atan(lambda[i] * u)
}
return(VAL - 0.5 * x * u)
}
rho <- function(u) {
VAL <- 0
for (i in 1:length(lambda)) {
VAL <- VAL + log(1 + lambda[i]^2 * u^2)
}
VAL <- exp(VAL * 0.25)
return(VAL)
}
fun <- function(u) return(sin(theta(u))/(u * rho(u)))
pval <- 0
try(pval <- 0.5 + integrate(fun, 0, Inf, subdivisions = 1e+06)$value/pi,
silent = TRUE)
if (pval > 0.001)
return(pval)
if (pval <= 0.001) {
df <- sum(lambda != 0)
est1 <- dchisq(STAT/max(lambda), df)
logf <- function(t) {
ans <- -t * STAT
ans <- ans - 0.5 * sum(log(1 - 2 * t * lambda))
return(ans)
}
est2 <- 1
try(est2 <- exp(nlm(logf, 1/(4 * max(lambda)))$minimum),
silent = TRUE)
return(min(est1, est2))
}
}
cvm.stat.disc <- function(x, y, type = c("W2", "U2", "A2")) {
type <- match.arg(type)
I <- knots(y)
N <- length(x)
e <- diff(c(0, N * y(I)))
obs <- rep(0, length(I))
for (j in 1:length(I)) {
obs[j] <- length(which(x == I[j]))
}
S <- cumsum(obs)
T <- cumsum(e)
H <- T/N
p <- e/N
t <- (p + p[c(2:length(p), 1)])/2
Z <- S - T
Zbar <- sum(Z * t)
S0 <- diag(p) - p %*% t(p)
A <- matrix(1, length(p), length(p))
A <- apply(row(A) >= col(A), 2, as.numeric)
E <- diag(t)
One <- rep(1, nrow(E))
K <- diag(0, length(H))
diag(K)[-length(H)] <- 1/(H[-length(H)] * (1 - H[-length(H)]))
Sy <- A %*% S0 %*% t(A)
M <- switch(type, W2 = E, U2 = (diag(1, nrow(E)) - E %*%
One %*% t(One)) %*% E %*% (diag(1, nrow(E)) - One %*%
t(One) %*% E), A2 = E %*% K)
lambda <- eigen(M %*% Sy)$values
STAT <- switch(type, W2 = sum(Z^2 * t)/N, U2 = sum((Z -
Zbar)^2 * t)/N, A2 = sum((Z^2 * t/(H * (1 - H)))[-length(I)])/N)
return(c(STAT, lambda))
}
cvm.pval.disc.sim <- function(STATISTIC, lambda, y, type,
tol, B) {
knots.y <- knots(y)
fknots.y <- y(knots.y)
u <- runif(B * length(x))
u <- sapply(u, function(a) return(knots.y[sum(a > fknots.y) +
1]))
dim(u) <- c(B, length(x))
s <- apply(u, 1, cvm.stat.disc, y, type)
s <- s[1, ]
return(sum(s >= STATISTIC - tol)/B)
}
type <- match.arg(type)
DNAME <- deparse(substitute(x))
if (is.stepfun(y)) {
if (length(setdiff(x, knots(y))) != 0) {
stop("Data are incompatable with null distribution; ",
"Note: This function is meant only for discrete distributions ",
"you may be receiving this error because y is continuous.")
}
tempout <- cvm.stat.disc(x, y, type = type)
STAT <- tempout[1]
lambda <- tempout[2:length(tempout)]
if (!simulate.p.value) {
PVAL <- cvm.pval.disc(STAT, lambda)
}
else {
PVAL <- cvm.pval.disc.sim(STAT, lambda, y, type,
tol, B)
}
METHOD <- paste("Cramer-von Mises -", type)
names(STAT) <- as.character(type)
RVAL <- list(statistic = STAT, p.value = PVAL, alternative = "Two.sided",
method = METHOD, data.name = DNAME)
}
else {
stop("Null distribution must be a discrete.")
}
class(RVAL) <- "htest"
return(RVAL)
}
<environment: namespace:dgof>
Kolmogorov-Smirnov ks.test() from stats package for comparison (note that this function does both the one-sample and two-sample tests):
> ks.test
function (x, y, ..., alternative = c("two.sided", "less", "greater"),
exact = NULL, tol = 1e-08, simulate.p.value = FALSE, B = 2000)
{
pkolmogorov1x <- function(x, n) {
if (x <= 0)
return(0)
if (x >= 1)
return(1)
j <- seq.int(from = 0, to = floor(n * (1 - x)))
1 - x * sum(exp(lchoose(n, j) + (n - j) * log(1 - x -
j/n) + (j - 1) * log(x + j/n)))
}
exact.pval <- function(alternative, STATISTIC, x, n, y, knots.y,
tol) {
ts.pval <- function(S, x, n, y, knots.y, tol) {
f_n <- ecdf(x)
eps <- min(tol, min(diff(knots.y)) * tol)
eps2 <- min(tol, min(diff(y(knots.y))) * tol)
a <- rep(0, n)
b <- a
f_a <- a
for (i in 1:n) {
a[i] <- min(c(knots.y[which(y(knots.y) + S >=
i/n + eps2)[1]], Inf), na.rm = TRUE)
b[i] <- min(c(knots.y[which(y(knots.y) - S >
(i - 1)/n - eps2)[1]], Inf), na.rm = TRUE)
f_a[i] <- ifelse(!(a[i] %in% knots.y), y(a[i]),
y(a[i] - eps))
}
f_b <- y(b)
p <- rep(1, n + 1)
for (i in 1:n) {
tmp <- 0
for (k in 0:(i - 1)) {
tmp <- tmp + choose(i, k) * (-1)^(i - k - 1) *
max(f_b[k + 1] - f_a[i], 0)^(i - k) * p[k +
1]
}
p[i + 1] <- tmp
}
p <- max(0, 1 - p[n + 1])
if (p > 1) {
warning("numerical instability in p-value calculation.")
p <- 1
}
return(p)
}
less.pval <- function(S, n, H, z, tol) {
m <- ceiling(n * (1 - S))
c <- S + (1:m - 1)/n
CDFVAL <- H(sort(z))
for (j in 1:length(c)) {
ifelse((min(abs(c[j] - CDFVAL)) < tol), c[j] <- 1 -
c[j], c[j] <- 1 - CDFVAL[which(order(c(c[j],
CDFVAL)) == 1)])
}
b <- rep(0, m)
b[1] <- 1
for (k in 1:(m - 1)) b[k + 1] <- 1 - sum(choose(k,
1:k - 1) * c[1:k]^(k - 1:k + 1) * b[1:k])
p <- sum(choose(n, 0:(m - 1)) * c^(n - 0:(m - 1)) *
b)
return(p)
}
greater.pval <- function(S, n, H, z, tol) {
m <- ceiling(n * (1 - S))
c <- 1 - (S + (1:m - 1)/n)
CDFVAL <- c(0, H(sort(z)))
for (j in 1:length(c)) {
if (!(min(abs(c[j] - CDFVAL)) < tol))
c[j] <- CDFVAL[which(order(c(c[j], CDFVAL)) ==
1) - 1]
}
b <- rep(0, m)
b[1] <- 1
for (k in 1:(m - 1)) b[k + 1] <- 1 - sum(choose(k,
1:k - 1) * c[1:k]^(k - 1:k + 1) * b[1:k])
p <- sum(choose(n, 0:(m - 1)) * c^(n - 0:(m - 1)) *
b)
return(p)
}
p <- switch(alternative, two.sided = ts.pval(STATISTIC,
x, n, y, knots.y, tol), less = less.pval(STATISTIC,
n, y, knots.y, tol), greater = greater.pval(STATISTIC,
n, y, knots.y, tol))
return(p)
}
sim.pval <- function(alternative, STATISTIC, x, n, y, knots.y,
tol, B) {
fknots.y <- y(knots.y)
u <- runif(B * length(x))
u <- sapply(u, function(a) return(knots.y[sum(a > fknots.y) +
1]))
dim(u) <- c(B, length(x))
getks <- function(a, knots.y, fknots.y) {
dev <- c(0, ecdf(a)(knots.y) - fknots.y)
STATISTIC <- switch(alternative, two.sided = max(abs(dev)),
greater = max(dev), less = max(-dev))
return(STATISTIC)
}
s <- apply(u, 1, getks, knots.y, fknots.y)
return(sum(s >= STATISTIC - tol)/B)
}
alternative <- match.arg(alternative)
DNAME <- deparse(substitute(x))
x <- x[!is.na(x)]
n <- length(x)
if (n < 1L)
stop("not enough 'x' data")
PVAL <- NULL
if (is.numeric(y)) {
DNAME <- paste(DNAME, "and", deparse(substitute(y)))
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
if (n.y < 1L)
stop("not enough 'y' data")
if (is.null(exact))
exact <- (n.x * n.y < 10000)
METHOD <- "Two-sample Kolmogorov-Smirnov test"
TIES <- FALSE
n <- n.x * n.y/(n.x + n.y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
if (length(unique(w)) < (n.x + n.y)) {
warning("cannot compute correct p-values with ties")
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)]
TIES <- TRUE
}
STATISTIC <- switch(alternative, two.sided = max(abs(z)),
greater = max(z), less = -min(z))
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below that of y", greater = "the CDF of x lies above that of y")
if (exact && (alternative == "two.sided") && !TIES)
PVAL <- 1 - .C("psmirnov2x", p = as.double(STATISTIC),
as.integer(n.x), as.integer(n.y), PACKAGE = "dgof")$p
}
else if (is.stepfun(y)) {
z <- knots(y)
if (is.null(exact))
exact <- (n <= 30)
if (exact && n > 30) {
warning("numerical instability may affect p-value")
}
METHOD <- "One-sample Kolmogorov-Smirnov test"
dev <- c(0, ecdf(x)(z) - y(z))
STATISTIC <- switch(alternative, two.sided = max(abs(dev)),
greater = max(dev), less = max(-dev))
if (simulate.p.value) {
PVAL <- sim.pval(alternative, STATISTIC, x, n, y,
z, tol, B)
}
else {
PVAL <- switch(exact, `TRUE` = exact.pval(alternative,
STATISTIC, x, n, y, z, tol), `FALSE` = NULL)
}
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below the null hypothesis",
greater = "the CDF of x lies above the null hypothesis")
}
else {
if (is.character(y))
y <- get(y, mode = "function")
if (mode(y) != "function")
stop("'y' must be numeric or a string naming a valid function")
if (is.null(exact))
exact <- (n < 100)
METHOD <- "One-sample Kolmogorov-Smirnov test"
TIES <- FALSE
if (length(unique(x)) < n) {
warning(paste("default ks.test() cannot compute correct p-values with ties;\n",
"see help page for one-sample Kolmogorov test for discrete distributions."))
TIES <- TRUE
}
x <- y(sort(x), ...) - (0:(n - 1))/n
STATISTIC <- switch(alternative, two.sided = max(c(x,
1/n - x)), greater = max(1/n - x), less = max(x))
if (exact && !TIES) {
PVAL <- if (alternative == "two.sided")
1 - .C("pkolmogorov2x", p = as.double(STATISTIC),
as.integer(n), PACKAGE = "dgof")$p
else 1 - pkolmogorov1x(STATISTIC, n)
}
nm_alternative <- switch(alternative, two.sided = "two-sided",
less = "the CDF of x lies below the null hypothesis",
greater = "the CDF of x lies above the null hypothesis")
}
names(STATISTIC) <- switch(alternative, two.sided = "D",
greater = "D^+", less = "D^-")
pkstwo <- function(x, tol = 1e-06) {
if (is.numeric(x))
x <- as.vector(x)
else stop("argument 'x' must be numeric")
p <- rep(0, length(x))
p[is.na(x)] <- NA
IND <- which(!is.na(x) & (x > 0))
if (length(IND)) {
p[IND] <- .C("pkstwo", as.integer(length(x[IND])),
p = as.double(x[IND]), as.double(tol), PACKAGE = "dgof")$p
}
return(p)
}
if (is.null(PVAL)) {
PVAL <- ifelse(alternative == "two.sided", 1 - pkstwo(sqrt(n) *
STATISTIC), exp(-2 * n * STATISTIC^2))
}
RVAL <- list(statistic = STATISTIC, p.value = PVAL, alternative = nm_alternative,
method = METHOD, data.name = DNAME)
class(RVAL) <- "htest"
return(RVAL)
}
<environment: namespace:dgof>

Resources