I'm wondering in the following, why when I directly provide the value of q (see f2), uniroot() works perfectly fine BUT when instead I provide q as a function of other input values uniroot() (see f1) fails?
In the code, everything that has ...1 suffix (e.g., f1) relates to when I indirectly provide q. And everything that has ...2 suffix (e.g., f2) relates to when I directly provide q.
My goal is to solve for df2 such that y = .15 (correct answer is ~ 336.3956). (please just run the entire code below.)
alpha = c(.025, .975); df1 = 3; q = 48.05649 ; peta = .3 # input values
f1 <- function(alpha, q, df1, df2, ncp){ # Objective function (`q` indirectly)
alpha - suppressWarnings(pf(q = (peta / df1) / ((1 - peta)/df2), df1, df2,
ncp, lower.tail = FALSE))
}
f2 <- function(alpha, q, df1, df2, ncp){ # Objective function (`q` directly)
alpha - suppressWarnings(pf(q = q, df1, df2, ncp, lower.tail = FALSE))
}
ncp1 <- function(df2){ # root finding
b <- sapply(c(alpha[1], alpha[2]),
function(x) uniroot(f1, c(0, 1e7), alpha = x, q = peta, df1 = df1, df2 = df2)[[1]])
b / (b + (df2 + 4))
}
ncp2 <- function(df2){ # root finding
b <- sapply(c(alpha[1], alpha[2]),
function(x) uniroot(f2, c(0, 1e7), alpha = x, q = q, df1 = df1, df2 = df2)[[1]])
b / (b + (df2 + 4))
}
m1 <- function(df2, y){ # A Utility function
abs(abs(diff(ncp1(df2))) - y)
}
m2 <- function(df2, y){ # A Utility function
abs(abs(diff(ncp2(df2))) - y)
}
optimize(m1, c(1, 1e7), y = .15)[[1]] # Incorrect answer: 1e+07
optimize(m2, c(1, 1e7), y = .15)[[1]] # Correct answer: 336.3956
In ncp1
you have q = peta which then gets passed on to f1 but then does not actually get used, as pf takes q as (peta / df1) / ((1 - peta)/df2).
In ncp2
you have q = peta which then gets passed on to f2 and in turn to pf.
So the bottom line is you use different values for q in pf. If you re-active warnings, you will see that f1 as part of ncp1 fails to reach convergence.
Related
I'm fitting linear models with MatrixModels:::lm.fit.sparse and MatrixModels::glm4 (also sparse).
However, these functions return coeff, residuals and fitted.values only.
What's the fastest and easiest way to get/calculate another values such as stderr, t-value, p-value, predict value?
I use the data from MatrixModels:::lm.fit.sparse example.
I built a custom function summary_sparse to perform a summary for this model.
All matrix operations are performed with Matrix package.
Results are compared with dense type model.
Note lm.fit.sparse have to be evaluated with method = "chol" to get proper results.
Functions:
summary_sparse <- function(l, X) {
XXinv <- Matrix::chol2inv(Matrix::chol(Matrix::crossprod(X)))
se <- sqrt(Matrix::diag(XXinv*sum(l$residuals**2)/(nrow(X)-ncol(X))))
ts <- l$coef/se
pvals <- 2*c(1 - pnorm(abs(ts)))
list(coef = l$coef, se = se, t = ts, p = pvals)
}
predict_sparse <- function(X, coef) {
X %*% coef
}
Application:
dd <- expand.grid(a = as.factor(1:3),
b = as.factor(1:4),
c = as.factor(1:2),
d= as.factor(1:8))
n <- nrow(dd <- dd[rep(seq_len(nrow(dd)), each = 10), ])
set.seed(17)
dM <- cbind(dd, x = round(rnorm(n), 1))
## randomly drop some
n <- nrow(dM <- dM[- sample(n, 50),])
dM <- within(dM, { A <- c(2,5,10)[a]
B <- c(-10,-1, 3:4)[b]
C <- c(-8,8)[c]
D <- c(10*(-5:-2), 20*c(0, 3:5))[d]
Y <- A + B + A*B + C + D + A*D + C*x + rnorm(n)/10
wts <- sample(1:10, n, replace=TRUE)
rm(A,B,C,D)
})
X <- Matrix::sparse.model.matrix( ~ (a+b+c+d)^2 + c*x, data = dM)
Xd <- as(X,"matrix")
fmDense <- lm(dM[,"Y"]~Xd-1)
ss <- summary(fmDense)
r1 <- MatrixModels:::lm.fit.sparse(X, y = dM[,"Y"], method = "chol")
f <- summary_sparse(r1, X)
all.equal(do.call(cbind, f), ss$coefficients, check.attributes = F)
#TRUE
all.equal(predict_sparse(X, r1$coef)#x, predict(fmDense), check.attributes = F, check.names=F)
#TRUE
I have written the following code.
library(quantreg)
# return the g function:
G = function(m, N, gamma) {
Tm = m * N
k = 1:Tm
Gvalue = sqrt(m) * (1 + k/m) * (k/(m + k))^gamma
return(Gvalue)
}
sqroot <- function(A) {
e = eigen(A)
v = e$vectors
val = e$values
sq = v %*% diag(sqrt(val)) %*% solve(v)
return(t(sq))
}
fa = function(m, N, a) {
Tm = m * N
k = 1:Tm
t = (m + k)/m
f_value = (t - 1) * t * (a^2 + log(t/(t - 1)))
return(sqrt(f_value))
}
m = 50
N = 2
n= 50*3
x1 = matrix(runif(n, 0, 1), ncol = 1)
x = cbind(1, x1)
beta = c(1, 1)
xb = x %*% beta
pr = 1/(1+exp(-xb))
y = rbinom(n,1,pr)
# calculate statistic:
stat = function(y, x, m, N, a) {
y_train = y[1:m]
x_train = x[(1:m),]
y_test = y[-(1:m)]
x_test = x[-(1:m),]
fit = glm(y ~ 0 + x, family="binomial")
coef = coef(fit)
log_predict = predict(fit, type="response")
sigma = sqrt(1/(m-1)* sum((y_train - log_predict)^2))
Jvalue = t(x_train) %*% x_train/m * sigma^2
Jsroot = sqroot(Jvalue)
fvalue = fa(m, N, a)
score1 = apply((x_test * as.vector((y_test - x_test %*% coef))), 2, cumsum)
statvalue1 = t(solve(Jsroot) %*% t(score1))/fvalue/sqrt(m)
statmax1 = pmax(abs(statvalue1[, 1]), abs(statvalue1[, 2]))
result = list(stat = statmax1)
return(result)
}
m =50
N = 2
a = 2.795
value = stat(y, x, m, N, a)
value
I want to perform bootstrap to obtain B = 999 number of statistics. I use the following r code. But it produces an error saying "Error in statistic(data, original, ...) :
argument "m" is missing, with no default"
library(boot)
data1 = data.frame(y = y, x = x1, m = m , N = N, a = a)
head(data1)
boot_value = boot(data1, statistic = stat, R = 999)
Can anyone give me a hint? Also, am I able to get the bootstrap results in a matrix format? Since the stat function gives 100 values.
There are different kinds of bootstrapping. If you want to draw from your data 999 samples with replications of same size of your data you may just use replicate, no need for packages.
We put the data to be resampled into a data frame. It looks to me like m, N, a remain constant, so we just provide it as vectors.
data2 <- data.frame(y=y, x=x)
stat function needs to be adapted to unpack y and x-matrix. At the bottom we remove the list call to get just a vector back. unnameing will just give us the numbers.
stat2 <- function(data, m, N, a) {
y_train <- data[1:m, 1]
x_train <- as.matrix(data[1:m, 2:3])
y_test <- data[-(1:m), 1]
x_test <- as.matrix(data[-(1:m), 2:3])
y <- data[, "y"]
x <- as.matrix(data[, 2:3])
fit <- glm(y ~ 0 + x, family="binomial")
coef <- coef(fit)
log_predict <- predict(fit, type="response")
sigma <- sqrt(1/(m-1) * sum((y_train - log_predict)^2))
Jvalue <- t(x_train) %*% x_train/m * sigma^2
Jsroot <- sqroot(Jvalue)
fvalue <- fa(m, N, a)
score1 <- apply((x_test * as.vector((y_test - x_test %*% coef))), 2, cumsum)
statvalue1 <- t(solve(Jsroot) %*% t(score1))/fvalue/sqrt(m)
statmax1 <- pmax(abs(statvalue1[, 1]), abs(statvalue1[, 2]))
result <- unname(statmax1)
return(result)
}
replicate is a cousin of sapply, designed for repeated evaluation. In the call we just sample the rows 999 times and already get a matrix back. As in sapply we need to transform our result.
res <- t(replicate(999, stat2(data2[sample(1:nrow(data2), nrow(data2), replace=TRUE), ], m, N, a)))
Result
As result we get 999 bootstrap replications in the rows with 100 attributes in the columns.
str(res)
# num [1:999, 1:100] 0.00205 0.38486 0.10146 0.12726 0.47056 ...
The code also runs quite fast.
user system elapsed
3.46 0.01 3.49
Note, that there are different kinds of bootstrapping. E.g. sometimes just a part of the sample is resampled, weights are used, clustering is applied etc. Since you attempted to use boot the method shown should be the default, though.
In the last line of my R code below, I use optimize() to find the df2 that minimizes the ncp_diff function.
However, I was wondering if I could uniroot() instead of optimize() for this minimization?
alpha = c(.025, .975); df1 = 3; peta = .3 # The input
f <- function(alpha, q, df1, df2, ncp){ # Notice `ncp` is the unknown
alpha - suppressWarnings(pf(q = (peta / df1) / ((1 - peta)/df2), df1, df2, ncp, lower = FALSE))
}
ncp <- function(df2){ # Root finding: finds 2 `ncp` for a given `df2`
b <- sapply(c(alpha[1], alpha[2]),
function(x) uniroot(f, c(0, 1e7), alpha = x, q = peta, df1 = df1, df2 = df2)[[1]])
b / (b + (df2 + 4))
}
ncp_diff <- function(df2, target = 0.15){
the_ncp <- ncp(df2)
return(abs(abs(the_ncp[2] - the_ncp[1]) - target))
}
optimize(ncp_diff, c(0, 1000)) ## HERE can I use `uniroot()` instead of `optimize()`
alpha = c(.025, .975); df1 = 3; peta = .3 # The input
f <- function(alpha, q, df1, df2, ncp){ # Notice `ncp` is the unknown
alpha - suppressWarnings(pf(q = (peta / df1) / ((1 - peta)/df2), df1, df2, ncp, lower = FALSE))
}
ncp <- function(df2){ # Root finding: finds 2 `ncp` for a given `df2`
b <- sapply(c(alpha[1], alpha[2]),
function(x) uniroot(f, c(0, 1e7), alpha = x, q = peta, df1 = df1, df2 = df2)[[1]])
b / (b + (df2 + 4))
}
ncp_diff <- function(df2, target = 0.15){
the_ncp <- ncp(df2)
return((the_ncp[2] - the_ncp[1]) - target)
}
uniroot(ncp_diff, c(100, 1000)) #
$root
[1] 336.3956
$f.root
[1] 3.74663e-09
$iter
[1] 7
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05
Edit:
In order to use the same interval of (0,1000) we can look for a way to go around a scenario where both the lower and upper values produce results on the same side of the number line. Since this is an error in r, we can go through it by tryCatch
ncp <- function(df2){ # Root finding: finds 2 `ncp` for a given `df2`
b <- sapply(c(alpha[1], alpha[2]),
function(x)
tryCatch(uniroot(f, c(0, 1e7), alpha = x, q = peta, df1 = df1, df2 = df2)[[1]],
error =function(e)NA ))
if(any(is.na(b)))b= c(1,10)
b / (b + (df2 + 4))
}
uniroot(ncp_diff, c(0, 1000)) #
$root
[1] 336.3956
$f.root
[1] -2.132438e-09
$iter
[1] 8
$init.it
[1] NA
$estim.prec
[1] 6.103516e-05
Below, I'm trying to solve for ncp (there is one answer). But I'm wondering why when I extend the interval argument in optimize the answer drastically changes?
Could I use uniroot instead of optimize here?
f <- function(pwr, q, df1, df2, ncp){
abs(pwr - pf(q, df1, df2, ncp, lower.tail = FALSE))
}
optimize(f, interval = c(0, 1e2), pwr = .8, q = 2.5, df1 = 3, df2 = 108)[[1]]
# [1] 10.54639 !!! HERE
optimize(f, interval = c(0, 5e2), pwr = .8, q = 2.5, df1 = 3, df2 = 108)[[1]]
# [1] 499.9999 !!! HERE
Because the rightmost part of the curve is too flat - all values beyond 150 are identical.
Utility function:
f2 <- function(x) f(x, pwr = .8, q = 2.5, df1 = 3, df2 = 108)
cc <- curve(f2(x)-0.2,from=150,to=500)
unique(cc$y)
## [1] -5.551115e-17
uniroot() does indeed work fine: we have to change the function f to return a signed value .
f <- function(pwr, q, df1, df2, ncp){
pwr - pf(q, df1, df2, ncp, lower.tail = FALSE)
}
uniroot(f, interval = c(0, 5e2), pwr = .8, q = 2.5, df1 = 3, df2 = 108)
## $root
## [1] 10.54641
## $f.root
## [1] -3.806001e-08
## etc.
In general, converting root-finding problems to minimum-finding problems by squaring or taking the absolute value is a fragile strategy (I read about this in Numerical Recipes years ago ...)
Starting with some sample data:
sample_data <- data.frame(id = 1:3,
x = c(128, 113, 126),
n = c(347, 344, 347),
m = c(335, 334, 347),
index = c(11, 9, -1))
theta <- matrix(c(0.5 ,0.5, 2, 2), nrow=2, ncol=2)
lhs <- function(a, b, g, d, dat){
beta(a + dat$x, b + dat$n - dat$x) / beta(a, b) * beta(g, d + dat$n) / beta(g, d)
}
The function lhs returns a vector of the same number of rows as the argument dat.
rhs <- function(dat, ...){
n = dat$n
m = dat$m
x = dat$x
index = dat$x
temp <- data.frame(i = 0:index,
n = rep(n, index + 1) ,
m = rep(m, index + 1) ,
x = rep(x, index + 1))
sum(beta(a + temp$x, b + temp$m - temp$x + temp$i) / beta(a,b) * beta(g + 1, d + temp$m + temp$i) / beta(g, d))
}
The function rhs works only on a single row because each observation has a different value for index (the index for the sum inside rhs). The intent is to return one value per row in dat. I've tried to do this with apply in the function LL (below).
LL <- function(theta, dat){
a <- theta[1,1]
b <- theta[2,1]
g <- theta[1,2]
d <- theta[2,2]
.lhs <- lhs(a, b, g, d, dat)
.rhs <- ifelse(index > -1, apply(dat, 1, rhs), 0)
sum(log(.lhs+log.rhs))
}
It seems like I need to be able to pass the value of index, n, m, and x to rhs from a given row. That is, not a vector of length(data$n), but the value of n at that row being passed through apply in the function LL.
Is this the correct approach? How can I do such a thing?
Thanks.
Edit
I've clean things up a bit and made a slight modification to the sample data. The correct return value -I think! - can be arrived at by (explicitly passing a,b,g, and d)
sum(lhs(a = theta[1,1],
b = theta[2,1],
g = theta[1,2],
d = theta[2,2],
sample_data)) +
rhs(sample_data[1,]) +
rhs(sample_data[2,]) +
rhs(sample_data[3,])