Use the markovchain package to compare two empirically estimated Markov chains

Use the markovchain package to compare two empirically estimated Markov chains - r

I need to compare two probability matrices to know the degree of proximity of the chains, so I would use the resulting P-Value of the test.
I tried to use the markovchain r package, more specifically the divergenceTest function. But, the problem is that the function is not properly implemented. It is based on the test of the book "Statistical Inference Based on Divergence Measures" on page 139, I contacted the package developers, but they still have not corrected, so I tried to implement, but I'm having trouble, could anyone help me to find the error?
Parameters: freq_matrix: Is a frequency matrix used to estimate the probability matrix. hypothetic: Is the matrix used to compare with the estimated matrix.
divergenceTest3 <- function(freq_matrix, hypothetic){
n <- sum(freq_matrix)
empirical = freq_matrix
for (i in 1:length(hypothetic)){
empirical[i,] <- freq_matrix[i,]/rowSums(freq_matrix)[i]
}
M <- nrow(empirical)
v <- numeric()
out <- 2 * n / .phi2(1)
sum <- 0
c <- 0
for(i in 1:M){
sum2 <- 0
sum3 <- 0
for(j in 1:M){
if(hypothetic[i, j] > 0){
c <- c + 1
}
sum2 <- sum2 + hypothetic[i, j] * .phi(empirical[i, j] / hypothetic[i, j])
}
v[i] <- rowSums(freq_matrix)[i]
sum <- sum + ((v[i] / n) * sum2)
}
TStat <- out * sum
pvalue <- 1 - pchisq(TStat, c-M)
cat("The Divergence test statistic is: ", TStat, " the Chi-Square d.f. are: ", c-M," the p-value is: ", pvalue,"\n")
out <- list(statistic = TStat, p.value = pvalue)
return(out)
}
# phi function for divergence test
.phi <- function(x) {
out <- x*log(x) - x + 1
return(out)
}
# another phi function for divergence test
.phi2 <- function(x) {
out <- 1/x
return(out)
}

The divergence test has been replaced by the verifyHomogeneityfunction. It requires and input list of elements that can be coerced to a raw transition matrix (as of createSequenceMatrix). Then it tests whether they belong to the same unknown DTMC.
See the example below:
myMatr1<-matrix(c(0.2,.8,.5,.5),byrow=TRUE, nrow=2)
myMatr2<-matrix(c(0.5,.5,.4,.6),byrow=TRUE, nrow=2)
mc1<-as(myMatr1,"markovchain")
mc2<-as(myMatr2,"markovchain")
mc
mc2
sample1<-rmarkovchain(n=100, object=mc1)
sample2<-rmarkovchain(n=200, object=mc2)
# should reject
verifyHomogeneity(inputList = list(sample1,sample2))
#should accept
sample2<-rmarkovchain(n=200, object=mc1)
verifyHomogeneity(inputList = list(sample1,sample2))

Related

Converting the outputs of a for loop from a list to a data frame

I have constructed a discrete time SIR model using a loop within a function (i have added my code below).
Currently the results of the iterations are coming out as a list which seems to show all the S values first followed by the I values and then the R values, which I have deduced myself from the nature of the values.
I need the output as a data frame with the column names: 'Iteration', 'S', 'I' and 'R' from left to right and the corresponding values underneath such that when a row is read it will tell you the iteration and values of S, I and R at that iteration.
I do not know how to construct a data frame that and returns the output values in this way, I have only started learning R a few weeks ago and so am not yet proficient so any help would be HUGELY appreciated.
Thank you in advance.
#INITIAL CONDITIONS
S=999
I=1
R=0
#PARAMETERS
beta = 0.003 # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
#SIR MODEL WITH POISSON SAMPLING
discrete_SIR_model <- function(){
for(i in 1:30){ #the number of iterations of loop indicates the
#duration of the model in days
# i.e. 'i in 1:30' constitutes 30 days
deltaI<- rpois(1,beta * I * S) #rate at which individuals in the
#population are becoming infected
deltaR<-rpois(1,gamma * I)#rate at which infected individuals are
#recovering
S[i+1]<-S[i] -deltaI
I[i+1] <-I[i] + deltaI -deltaR
R[i+1]<-R[i]+deltaR
}
}
output <- list(c(S, I, R))
output

If a foor loop is used, one can define vectors or a data frame beforehand where the results are stored:
beta <- 0.001 # infectious contact rate (/person/day)
gamma <- 0.2 # recovery rate (/day)
S <- I <- R <- numeric(31)
S[1] <- 999
I[1] <- 1
R[1] <- 0
set.seed(123) # makes the example reproducible
for(i in 1:30){
deltaI <- rpois(1, beta * I[i] * S[i])
deltaR <- rpois(1, gamma * I[i])
S[i+1] <- S[i] - deltaI
I[i+1] <- I[i] + deltaI - deltaR
R[i+1] <- R[i] + deltaR
}
output <- data.frame(S, I, R)
output
matplot(output)
As an alternative, it is also possible to employ a package for this. Package deSolve is intended for differential equations, but it can also solve the discrete case with method "euler":
library(deSolve)
discrete_SIR_model <- function(t, y, p) {
with(as.list(c(y, p)), {
deltaI <- rpois(1, beta * I * S)
deltaR <- rpois(1, gamma * I)
list(as.double(c(-deltaI, deltaI - deltaR, deltaR)))
})
}
y0 <- c(S = 999.0, I=1, R=0)
p <- c(
beta = 0.001, # infectious contact rate (/person/day)
gamma = 0.2 # recovery rate (/day)
)
times <- 1:30
set.seed(576) # to make the example reproducible
output <- ode(y0, times, discrete_SIR_model, p, method="euler")
plot(output, mfrow=c(1,3))
Note: I reduced beta, otherwise the discrete model would become unstable.

R function writing - getting error: NaNs producedError in tsort[U + 1]only 0's may be mixed with negative subscripts

I am creating an R function that calculates a bootstrapped bias corrected and accelerated interval, (not using any pre-installed packages) My code seems to be working but am struggling actually writing the code for the lower and upper limits of the interval. Any suggestions would be helpful.
BCa <- function(stat,X,k,level=0.95,...){
if(!is.numeric(k)||k<=0){
stop("The number of bootstrap resamples 'k' must be a numeric value greater than 0")
}
t.star <- stat(X,...)
t.k <- rep(NA,k)
for(i in 1:k){
Xi <- sample(X,replace=TRUE)
t.k[i] <- stat(Xi,...)
}
z0 <- qnorm(mean(t.k<t.star))
n <- length(X)
t.minus.j <- rep(NA,n)
for(j in 1:n){
Xj <- X[-j]
t.minus.j[j]<- stat(Xj,...)
}
t.bar.minus <- mean(t.minus.j)
t.diff <- t.bar.minus - t.minus.j
a <- ((sum(t.diff^3))/(6*(t.diff^2)^3/2))
alpha <- 1-level
tsort <- sort(t.k, decreasing = FALSE)
L <- pnorm(z0 + ((z0+qnorm(alpha/2))/((1-a)*z0+qnorm(alpha/2))))
U <- qnorm(z0 + ((z0+qnorm(alpha/2))/((1-a)*z0+qnorm(alpha/2))))
if(!is.integer(L)){
L <- floor(L*(k+1))
}
if(!is.integer(U)){
U <- ceiling(U*(k+1))
}
lower.limit <- tsort[L]
upper.limit <- tsort[U+1]
return(list(t.star=t.star,ci=c(lower.limit,upper.limit)))
}

How to calculate standardized Pearson residuals by hand in R?

I am trying to calculate the standardized Pearson Residuals by hand in R. However, I am struggling when it comes to calculating the hat matrix.
I have built my own logistic regression and I am trying to calculate the standardized Pearson residuals in the logReg function.
logRegEst <- function(x, y, threshold = 1e-10, maxIter = 100)
{
calcPi <- function(x, beta)
{
beta <- as.vector(beta)
return(exp(x %*% beta) / (1 + exp(x %*% beta)))
}
beta <- rep(0, ncol(x)) # initial guess for beta
diff <- 1000
# initial value bigger than threshold so that we can enter our while loop
iterCount = 0
# counter for the iterations to ensure we're not stuck in an infinite loop
while(diff > threshold) # tests for convergence
{
pi <- as.vector(calcPi(x, beta))
# calculate pi by using the current estimate of beta
W <- diag(pi * (1 - pi))
# calculate matrix of weights W as defined int he fisher scooring algorithem
beta_change <- solve(t(x) %*% W %*% x) %*% t(x) %*% (y - pi)
# calculate the change in beta
beta <- beta + beta_change # new beta
diff <- sum(beta_change^2)
# calculate how much we changed beta by in this iteration
# if this is less than threshold, we'll break the while loop
iterCount <- iterCount + 1
# see if we've hit the maximum number of iterations
if(iterCount > maxIter){
stop("This isn't converging.")
}
# stop if we have hit the maximum number of iterations
}
n <- length(y)
df <- length(y) - ncol(x)
# calculating the degrees of freedom by taking the length of y minus
# the number of x columns
vcov <- solve(t(x) %*% W %*% x)
logLik <- sum(y * log(pi / (1 - pi)) + log(1 - pi))
deviance <- -2 * logLik
AIC <- -2 * logLik + 2 * ncol(x)
rank <- ncol(x)
list(coefficients = beta, vcov = vcov, df = df, deviance = deviance,
AIC = AIC, iter = iterCount - 1, x = x, y = y, n = n, rank = rank)
# returning results
}
logReg <- function(formula, data)
{
if (sum(is.na(data)) > 0) {
print("missing values in data")
} else {
mf <- model.frame(formula = formula, data = data)
# model.frame() returns us a data.frame with the variables needed to use the
# formula.
x <- model.matrix(attr(mf, "terms"), data = mf)
# model.matrix() creates a design matrix. That means that for example the
#"Sex"-variable is given as a dummy variable with ones and zeros.
y <- as.numeric(model.response(mf)) - 1
# model.response gives us the response variable.
est <- logRegEst(x, y)
# Now we have the starting position to apply our function from above.
est$formula <- formula
est$call <- match.call()
# We add the formular and the call to the list.
nullModel <- logRegEst(x = as.matrix(rep(1, length(y))), y)
est$nullDeviance <- nullModel$deviance
est$nullDf <- nullModel$df
mu <- exp(as.vector(est$x %*% est$coefficients)) /
(1 + exp(as.vector(est$x %*% est$coefficients)))
# computing the fitted values
est$residuals <- (est$y - mu) / sqrt(mu * (1 - mu))
est$mu <- mu
est$x <- x
est$y <- y
est$data <- data
hat <- (t(mu))^(1/2)%*%x%*%(t(x)%*%mu%*%x)^(-1)%*%t(x)%*%mu^(1/2)
est$stdresiduals <- est$residuals/(sqrt(1-hat))
class(est) <- "logReg"
# defining the class
est
}
}
I am struggling when it comes to calculating 𝐻=𝑉̂1/2𝑋(𝑋𝑇𝑉̂𝑋)−1𝑋𝑇𝑉̂1/2. This is called hat in my code.
If I try to calculate the hat matrix (hat) I get the error that I cannot multiply the vector mu and the matrix x in this case: t(x)%*%mu%*%x.
I can see that the rank of the matrices are not identical and therefor I can't multiply them.
Can Anyone see where my mistake is? Help is very appreciated. Thanks!

Newton Raphson for logistic regression

I did code for Newton Raphson for logistic regression. Unfortunately I tried many data there is no convergence. there is a mistake I do not know where is it. Can anyone help to figure out what is the problem.
First the data is as following; y indicate the response (0,1) , Z is 115*30 matrix which is the exploratory variables. I need to estimate the 30 parameters.
y = c(rep(0,60),rep(1,55))
X = sample(c(0,1),size=3450,replace=T)
Z = t(matrix(X,ncol=115))
#The code is ;
B = matrix(rep(0,30*10),ncol=10)
B[,1] = matrix(rep(0,30),ncol=1)
for(i in 2 : 10){
print(i)
p <- exp(Z %*%as.matrix(B[,i])) / (1 + exp(Z %*% as.matrix(B[,i])))
v.2 <- diag(as.vector(1 * p*(1-p)))
score.2 <- t(Z) %*% (y - p) # score function
increm <- solve(t(Z) %*% v.2 %*% Z)
B[,i] = as.matrix(B[,i-1])+increm%*%score.2
if(B[,i]-B[i-1]==matrix(rep(0.0001,30),ncol=1)){
return(B)
}
}

Found it! You're updating p based on B[,i], you should be using B[,i-1] ...
While I was finding the answer, I cleaned up your code and incorporated the results in a function. R's built-in glm seems to work (see below). One note is that this approach is likely to be unstable: fitting a binary model with 30 predictors and only 115 binary responses, and without any penalization or shrinkage, is extremely optimistic ...
set.seed(101)
n.obs <- 115
n.zero <- 60
n.pred <- 30
y <- c(rep(0,n.zero),rep(1,n.obs-n.zero))
X <- sample(c(0,1),size=n.pred*n.obs,replace=TRUE)
Z <- t(matrix(X,ncol=n.obs))
R's built-in glm fitter does work (it uses iteratively reweighted least squares, not N-R):
g1 <- glm(y~.-1,data.frame(y,Z),family="binomial")
(If you want to view the results, library("arm"); coefplot(g1).)
## B_{m+1} = B_m + (X^T V_m X)^{-1} X^T (Y-P_m)
NRfit function:
NRfit <- function(y,X,start,n.iter=100,tol=1e-4,verbose=TRUE) {
## used X rather than Z just because it's more standard notation
n.pred <- ncol(X)
B <- matrix(NA,ncol=n.iter,
nrow=n.pred)
B[,1] <- start
for (i in 2:n.iter) {
if (verbose) cat(i,"\n")
p <- plogis(X %*% B[,i-1])
v.2 <- diag(c(p*(1-p)))
score.2 <- t(X) %*% (y - p) # score function
increm <- solve(t(X) %*% v.2 %*% X)
B[,i] <- B[,i-1]+increm%*%score.2
if (all(abs(B[,i]-B[,i-1]) < tol)) return(B)
}
B
}
matplot(res1 <- t(NRfit(y,Z,start=coef(g1))))
matplot(res2 <- t(NRfit(y,Z,start=rep(0,ncol(Z)))))
all.equal(res2[6,],unname(coef(g1))) ## TRUE

MLE in R (multiple parameter function)

I'm trying to do MLE in R to simulate mark recapture methods. I'm trying to maximise this function;
likelihood <- function(N, p){
likelihood <- 1
X <- 0
for (s in 1:S){
X <- X + M[s]
}
likelihood <- likelihood * p^(X) * (1-p)^(N*S-X)
likelihood <- likelihood * (factorial(N)/factorial((N-length(U))))
return(likelihood)
}
I'm trying to maximise the N, all the other parameters are known or estimated and so assumed known
Have looked into nlm and optim but can't get either to do what I want...
In response to answers;
Thanks for all the answers, I recoded the likelihood and managed to get something working, see the code below. To answer questions/answers specifically
1. Thanks for X <- sum(M[1:S]) nice trick, The X is counting 'all the animals ever caught (not unique captures)'
2. M is generated in the code below, based on n (captures) it is a count of number of animals marked on each survey, S is number of surveys.
captures <- function(N, S, P){
P <- replicate(S, P) #Capture Probability (same across animals and surveys)
captures <- t(replicate(N, rbinom(S, 1, P))) #Generate capture data from N animals with S surveys with capture Probability P
return(captures)
}
marked <- function(N, S, captures){ #Count numbers that were marked on each survey
M <- replicate(S, 0) #Initialise the 'marked' variable to zero
for (s in 1:S){
for (i in 1:N){
if (captures[i, s] == 1){
M[s] <- M[s] + 1
}
}
}
return(M)
}
unseen <- function(N, S, captures){#Count total number of animals that were unseen on any survey
U <- array()
seen <- replicate(N, FALSE) #All animals begin as unseen
for (i in 1:N){
for (s in 1:S){
if (captures[i, s] == 1){
seen[i] = TRUE
}
}
if (seen[i] == FALSE){
U <- c(U, i)
}
}
U <- U[-1] #fix the N/A in the first index
return(U)
}
firstcaught <- function(N, S, captures){#Produce a vector containing how many animals were caught for the first time on each survey
fc <- 0
seen <- replicate(N, FALSE)
for (s in 1:S){
fc[s] <- 0
for (i in 1:N){
if ((seen[i] == FALSE) && (captures[i, s] == 1)){
fc[s] <- fc[s] + 1
seen[i] = TRUE
}
}
}
return(fc)
}
##Generate data using functions
N <- 200
S <- 10
P <- 0.2
n <- captures(N,S,P)
M <- marked(N,S,n)
U <- unseen(N,S,n)
fc <- firstcaught(N,S,n)
Mc <- cumsum(fc) #A running total of animals that have been captured at least once
##Define a likelihood for model M_0
likelihood <- function(N, P){
likelihood <- 1
X <- 0
for (s in 1:S){
X <- X + M[s]
likelihood <- likelihood * (1-P)^(N-M[s])
likelihood <- likelihood * choose(N-M[s],fc[s])
}
likelihood <- likelihood * P^(X) * (1-P)^(-X)
return(-log(likelihood))
}
##Find the MLE
out <- nlm(f = likelihood, p = 200, P = P, check.analyticals = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use the markovchain package to compare two empirically estimated Markov chains - r

Related

Converting the outputs of a for loop from a list to a data frame

R function writing - getting error: NaNs producedError in tsort[U + 1]only 0's may be mixed with negative subscripts

How to calculate standardized Pearson residuals by hand in R?

Newton Raphson for logistic regression

MLE in R (multiple parameter function)

Categories

Resources