Hamming distance measure for pvclust - r

I am trying to create a Hamming distance measure for the pvclust clustering method. (There isn't one defined for this function.) I'm based on the example given for the cosine measure:
cosine <- function(x) {
x <- as.matrix(x)
y <- t(x) %*% x
res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y))))
res <- as.dist(res)
attr(res, "method") <- "cosine"
return(res)
}
I try to do it this way:
hamming <- function(x) {
x <- as.matrix(x)
y <- t(x) %*% x
res <- sum(y != y)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}
Unfortunately it doesn't work properly. Anyone have any postings, where is the error and how to fix it?

Try this
hamming <- function(x) {
x <- as.matrix(x)
y <- (1 - x) %*% t(x)
res <- y + t(y)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}

Related

Estimating risk difference using delta approach

I want to estimate the confidence interval using the delta approach; an image of an example model is attached output of delta approach. The intervals are very wide compared to the MLE or bootstrap estimates for bootstrap approach. Can anyone suggest a way to derive the accurate CI's using the delta approach? Thank you
Mod1 <- Mod0 <- model.matrix(object)[subset, ]
n <- nrow(Mod0)
Nvec <- matrix(rep(c(1/n,0,0,1/n),each=n), n*2, 2)
if (method == 'delta') {
if (scale == 'linear') {
df <- deriv( ~y/x, c('x', 'y')) # y depends on x
}
out <- sapply(parm, function(p) {
Mod0[, p] <- 0
Mod1[, p] <- 1
Mod <- rbind(Mod0, Mod1)
allpreds <- family(object)$linkinv(Mod %*% coef(object))
avgpreds <- t(Nvec) %*% allpreds
val <- f(avgpreds)
V <- sweep(chol(vcov(object)) %*% t(Mod), allpreds*(1-allpreds), '*', MARGIN = 2) %*% Nvec
V <- t(V) %*% V
dxdy <- matrix(attr(eval(df, list('x'=avgpreds[1], 'y'=avgpreds[2])), 'gradient'))
se <- sqrt(t(dxdy) %*% V %*% dxdy)
out <- c(val, se, z <- abs({val-null}/se), 2*pnorm(abs(val/se), lower.tail=FALSE), val + qnorm(cilevel[1])*se, val + qnorm(cilevel[2])*se)
names(out) <- c(name, 'Std. Error', 'Z-value', 'p-value', ciname)
out
})
out <- t(out)
rownames(out) <- names(cf)
return(out)
} ## end if delta

What is going on inside the varimax function in R?

I have been trying to figure out the core part of the varimax function in R. I found a wiki link that writes out the algorithm. But why is B <- t(x) %*% (z^3 - z %*% diag(drop(rep(1, p) %*% z^2))/p) is computed? I also am not sure as to why SVD is computed of the matrix B. The iteration step is probably to maximize/minimize the variance, and the singular values would really be variances of Principal Components. But I am also unsure about that. I am pasting the whole code of varimax for convenience, but really the relevant part and therefore my question on what is actually happening under the hood, is within the for loop.
function (x, normalize = TRUE, eps = 1e-05)
{
nc <- ncol(x)
if (nc < 2)
return(x)
if (normalize) {
sc <- sqrt(drop(apply(x, 1L, function(x) sum(x^2))))
x <- x/sc
}
p <- nrow(x)
TT <- diag(nc)
d <- 0
for (i in 1L:1000L) {
z <- x %*% TT
B <- t(x) %*% (z^3 - z %*% diag(drop(rep(1, p) %*% z^2))/p)
sB <- La.svd(B)
TT <- sB$u %*% sB$vt
dpast <- d
d <- sum(sB$d)
if (d < dpast * (1 + eps))
break
}
z <- x %*% TT
if (normalize)
z <- z * sc
dimnames(z) <- dimnames(x)
class(z) <- "loadings"
list(loadings = z, rotmat = TT)
}
Edit: The algorithm is available in the book "Factor Analysis of Data Matrices" by Holt, Rinehart and Winston and the actual sources can be found therein. This book is also cited with the varimax function in R.

Minimum sample size n such that difference is no more than

What is the minimum sample size n (or the length n = length(x) of the data vector x) such that the difference D = 1 - statx4(x)/statx5(x) of the functions statx4 and statx5 is no more than 1/100 i.e. D ≤ 1/100?
And here are the functions:
statx4 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)
result <- numerator/denominator
return(result)
}
statx5 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)-1
result <- numerator/denominator
return(result)
}
I've been doing this exercise set for a while, but haven't managed to get anything valid on this question. Could you point me to right direction?
For the normal distribution, it is the following:
statx4 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)
result <- numerator/denominator
return(result)
}
statx5 <- function(x) {
numerator <- sum((x-mean(x))^2)
denominator <- length(x)-1
result <- numerator/denominator
return(result)
}
D <- function(x){
1-statx4(x)/statx5(x)
}
DD <- function(N=1111,seed =1){
set.seed(seed)
Logi <- vector()
for (n in 1:N) {
x<- rnorm(n)
y <- D(x)
Logi[n] <- (y > 1/100)
}
return(Logi)
}
min <- vector()
for (seed in 1:100) {
message(seed)
DD(1000,seed)
min[seed] <- length(which(DD(1000) == TRUE))
}
Answer <- mean(min)+1
Answer
Note that the function D evaluates the difference of the unbiased variance and the ordinal variance.
I think this problem should be more clear in mathematical sense.
I got solutions today and all you had to do was guess random values:
a <- rnorm(99); 1-statx4(a)/statx5(a)
a <- rnorm(100); 1-statx4(a)/statx5(a)
a <- rnorm(101); 1-statx4(a)/statx5(a)`
And correct answer is 100.
Thank you all for help.

I try to draw a qqplot for burr x distribution. coding is ok. why can't draw it?

This is my program for qqplot for burr X type distribution. I know coding is right, but I don't understand why I can't run plot?
burrx.loglike <- function(params, x)
{
theta <- params[1]
sigma <- params[2]
n <- length(x)
if (theta <= 0 || sigma <= 0)
{
ans <- -Inf
}
else
{
ans <- (n*log(2) + sum(log(x)) + n*log(theta) - 2*n*log(sigma)
- sum(x^2)/sigma^2 + (theta-1)*sum(log(1-exp(-1*(x/sigma)^2))))
}
return(ans)
}
burrx.mle2 <- function(x, par0=c(1,1))
{
temp.mle <- optim(par0, burrx.loglike, x=x, method="Nelder-Mead", control=list(fnscale=-1))
return(temp.mle)
}
qqburrx <- function(x, theta, sigma, use.mle=TRUE)
{
# Check to see if we calculate the MLE.
if(use.mle == TRUE)
{
par0 <- c(theta,sigma)
temp.mle <- burrx.mle2(x, par0)
theta <- temp.mle$par[1]
sigma <- temp.mle$par[2]
}
# Sample Quantiles
x.sort <- sort(x)
# Theoretical Quantiles
n <- length(x)
i <- 1:n
x.quantiles <- qburrx(q=i/(n+1), theta=theta, sigma=sigma)
# Plot the data.
plot.min <- min(x.sort, x.quantiles)
plot.max <- max(x.sort, x.quantiles)
plot(x.quantiles, x.sort,
main="Burr type X Q-Q Plot\nNote: For the BurrX to be appropriate,data must fall near the 40deg line.",
xlab="Theoretical Quantiles", ylab="Sample Quantiles",
xlim=c(plot.min,plot.max), ylim=c(plot.min,plot.max))
# Add 45-degree line
line.coord <- c(plot.min, plot.max)
lines(line.coord, line.coord)
}
Basically I'm a beginner of r. Maybe I'm making some error at the time of input parameter.
just use
qqburrx(1:2000,2,1,use.mle=TRUE)
change x as a:b according to your parameter.

Mean variance optimisation

I am doing a mean variance optimization to solve portfolios optimization problem. What I am trying to do is to minimize the variance with respect both constraints :
x1m1+x2m2+...+xnmn=m
x1+x2+...+xn=1
So this is the code I did:
################ Simulation for n=3 ################
################ Parameters ################
mu<-50 ## Mean of the portfolio
n<-3 ## Number of asset
m1<-30000 ## Size of the simulation
########### 3 Assets ############
x<- rnorm(m1,2,1)
y<- rnorm(m1,0.5,1.5)
z<- rnorm(m1,3.75,1)
d<-data.frame(x,y,z)
################ Solution Directe ################
Sol<-function(m1) {
A = matrix(nrow=n+2, ncol=n+2)
for (i in 1:n){
for (j in 1:n)
if(i==j) {
A[i,j] <- (2*var(d[,i]))
} else {
A[i,j] <- cov(d[,i],d[,j])
}
}
for (i in 1:n){
A[i,n+1] <- -mean(d[,i])
A[i,n+2] <- -1
}
for (j in 1:n){
A[n+1,j] <- mean(d[,j])
A[n+2,j] <- 1
}
for (i in 2:n+2){
for (j in 2:n+2)
if(i==j) {
A[i,j] <- 0
} else {
A[i,j] <- 0
}
}
A
Inv=solve(A)
Sol=Inv%*%c(0,0,0,m1,1)
result=list(x=Sol,A=A,Inv=Inv)
return(result)
}
Sol(mu)
Sol(mu)$x ## The solution
Sol(mu)$A
I known, I´m using a lot of bad things for R, but I could not figure out a better solution. So my question is it correct?
Any correction and suggestion to improve this process! please feel free to share your extant code in R.
Huge thanks!
One way is to minimize numerically by solnp() from the Rsolnp package. This also offers a way to add more restrictions (leverage constraints etc):
muVec <- colMeans(d) #mean-vector of assets
Sigma <- cov(d) #covariance-matrix
fmin <- function(x) as.numeric(t(x) %*% Sigma %*% x) #variance of portfolio to min.
eqn <- function(x) c(t(x) %*% muVec, sum(x)) #equality restriction
sol <- function(mu) Rsolnp::solnp(rep(0.5, 3), fun=fmin, eqfun=eqn, eqB=c(mu,1))
x <- sol(50)
after solving we can now print the parameters and portfolio variance:
> x$par
[1] -5.490106 -11.270906 17.761012
> x$vscale[1]
[1] 630.4916
In your simple case a closed solution exists and can be boiled down to:
S <- solve(Sigma)
A <- matrix( c(t(muVec) %*% S %*% muVec,
rep( t(muVec) %*% S %*% rep(1,3), 2),
t(rep(1,3)) %*% S %*% rep(1,3)), ncol=2
)
sol2 <- function(mu) S %*% cbind(muVec,1) %*% solve(A) %*% c(mu,1)
which "luckily" gives the same results:
> sol2(50)
[,1]
x -5.490106
y -11.270906
z 17.761012
> fmin(sol2(50))
[1] 630.4916

Resources