Related
I have a dataset (not normal distribution) with repeated measures over time. As such I was planning to use the Friedman.test in R:
friedman.test(dash ~ time | nr)
this gave:
Friedman rank sum test
data: dash and time and nr
Friedman chi-squared = 105.26, df = 2, p-value < 2.2e-16
To calculate the CI of the chi-squared I tried the following:
friedman_boot <- function(data, indices) {
return(friedman.test(dash ~ time | nr, data = data_t[indices, ])$statistic)
}
boot_results <- boot(data = data_t, statistic = friedman_boot, R = 1000)
boot_ci <- boot.ci(boot.out = boot_results, type = "perc", conf = 0.95)
boot_ci
which gave:
boot_results <- boot(data = data_t, statistic = friedman_boot, R = 1000)
Error in friedman.test.default(mf[[1L]], mf[[2L]], mf[[3L]]) :
not an unreplicated complete block design
I do not quite understand why this happens. Does anyone have another way to calculate CI for the test statistic?
To calculate the CI of the chi-squared I tried the following:
friedman_boot <- function(data, indices) {
return(friedman.test(dash ~ time | nr, data = data_t[indices, ])$statistic)
}
boot_results <- boot(data = data_t, statistic = friedman_boot, R = 1000)
boot_ci <- boot.ci(boot.out = boot_results, type = "perc", conf = 0.95)
boot_ci
which gave:
boot_results <- boot(data = data_t, statistic = friedman_boot, R = 1000)
Error in friedman.test.default(mf[[1L]], mf[[2L]], mf[[3L]]) :
not an unreplicated complete block design
I do not quite understand why this happens. Does anyone have another way to calculate CI for the test statistic?
I have edited my question
Goal
I want to keep only those objects that were successfully created and ignore those that threw errors.
Example
Please note that this is just a reproducible example. My original dataset is different.
The following function takes any variable of mtcars dataset, fits three theoretical distributions, and then returns the goodness of fit stats:
library(fitdistrplus)
fit_distt <- function(var) {
v <- mtcars[, var]
f1 <- fitdist(data = v, distr = "norm")
f2 <- fitdist(data = v, distr = "nbinom")
f3 <- fitdist(data = v, distr = "gamma")
gofstat(f = list(f1, f2, f3),
chisqbreaks = c(0, 3, 3.5, 4, 4.5,
5, 10, 20, 30, 40),
fitnames = c("normal", "nbinom", "gamma"))
}
For instance:
> fit_distt("gear")
Goodness-of-fit statistics
normal nbinom gamma
Kolmogorov-Smirnov statistic 0.2968616 0.4967268 0.3030232
Cramer-von Mises statistic 0.4944390 1.5117544 0.5153004
Anderson-Darling statistic 3.1060083 7.2858460 3.1742713
Goodness-of-fit criteria
normal nbinom gamma
Akaike's Information Criterion 74.33518 109.9331 72.07507
Bayesian Information Criterion 77.26665 112.8646 75.00655
Problem
Some theoretical distributions do not successfully fit on a variable, and fitdist throws an error:
> fit_distt("mpg")
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Error in fitdist(data = v, distr = "nbinom") :
the function mle failed to estimate the parameters,
with the error code 100
This error occurred with f2 that tries to fit the nbinom on a continuous variable mpg. But the norm and gamma successfully fit.
I want to return the gofstat for the successfully fit distributions and ignore the ones that threw error.
Expected output
Even though f2 is specified in the function, if it throws an error, I still want the following output:
> fit_distt("mpg")
Goodness-of-fit statistics
normal gamma
Kolmogorov-Smirnov statistic 0.12485059 0.08841088
Cramer-von Mises statistic 0.08800019 0.03793323
Anderson-Darling statistic 0.58886727 0.28886166
Goodness-of-fit criteria
normal gamma
Akaike's Information Criterion 208.7555 205.8416
Bayesian Information Criterion 211.6870 208.7731
What I tried
Obviously, I can just remove f2 from the function. But that means repeating all the code for each variable. That's a lot of code! So, I still want to use the function.
And I want to be able to use the function for any variable. With mtcars$mpg, the function fails for nbinom, but with mtcars$vs, the function fails for gamma. For any case,I want to skip the fits that threw error and report gofstat for fits that worked.
I can use purrr::possibly to quietly return a fit result or throw the error without stopping at the error. But I don't know how to return the successfully fit values only in the gofstat.
You could try with try. Try to fit the distribution and only add it to the list you pass to gofstat if it works:
library(fitdistrplus)
#> Loading required package: MASS
#> Loading required package: survival
fit_distt <- function(var) {
v <- mtcars[, var]
distributions <- c("norm", "nbinom", "gamma")
fs <- list()
fitted_distributions <- vector(mode = "character")
for (i in seq_along(distributions)) {
# try to fit the model
fit <- try(fitdist(data = v, distr = distributions[i]), silent = TRUE)
# if it works, add it to fs. If not, ¯\_(ツ)_/¯
if (!inherits(fit, "try-error")) {
fs[[length(fs)+1]] <- fit
fitted_distributions[length(fitted_distributions)+1] <- distributions[i]
}
}
gofstat(f = fs,
chisqbreaks = c(0, 3, 3.5, 4, 4.5,
5, 10, 20, 30, 40),
fitnames = fitted_distributions)
}
fit_distt("mpg")
#> <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
#> Goodness-of-fit statistics
#> norm gamma
#> Kolmogorov-Smirnov statistic 0.12485059 0.08841088
#> Cramer-von Mises statistic 0.08800019 0.03793323
#> Anderson-Darling statistic 0.58886727 0.28886166
#>
#> Goodness-of-fit criteria
#> norm gamma
#> Akaike's Information Criterion 208.7555 205.8416
#> Bayesian Information Criterion 211.6870 208.7731
Created on 2020-10-07 by the reprex package (v0.3.0)
You can replace your individual list subsets with a single lapply. If you get this to return a NULL rather than an NA, the entry will disappear after being unlisted. The following function will therefore do what you want, as shown in this reprex:
find_mean_of_each_vector_in_a_list <- function(my_list) {
suppressWarnings(
as.numeric(
unlist(
sapply(my_list, function(x) if(is.na(mean(x))) NULL else mean(x))
)))
}
my_list_1 <- list(a = 1:3, b = 5:6, c = 7:10)
my_list_2 <- list(a = 1:3, b = c("a", "b"), c = 7:10)
find_mean_of_each_vector_in_a_list(my_list_1)
#> [1] 2.0 5.5 8.5
find_mean_of_each_vector_in_a_list(my_list_2)
#> [1] 2.0 8.5
Created on 2020-10-07 by the reprex package (v0.3.0)
I'm trying to estimate an Okun's law equation with a dlm using the dlm package in R. I can estimate the non-time varying model using nls as follows:
const_coef <- nls(formula = dur~ b1*dur_lag1 + b2*(d2lgdp-b0) + b3*d2lrulc_lag2 ,
start = list(b0 =0.1, b1=0.1, b2=0.1, b3=0.1),
data = mod_data)
the dlm model I want to be able to estimate allows for b1 and b0 in the above to follow random walks. I can do this in Eviews by declaring the measurement equation and appending the states (below is some code provided by the authors of the original paper which I can replicate:
'==========================
' SPECIFY THE KALMAN FILTER
'==========================
'Priors on state variables
vector(2) mprior
mprior(1) = 4 'Prior on starting value for trend GDP growth (annual average GDP growth over 1950s)
mprior(2) = 0 'Prior on starting value for lagged dependent variable
sym(2) vprior
vprior(1,1) = 5 'Prior on variance of trend GDP growth (variance of annual GDP growth over 1950s)
vprior(2,2) = 1 'Prior on variance of lagged dependent variable
'Specify coefficient vector
coef(8) ckf
'Declare state space
sspace ss1
ss1.append dur = lag*dur(-1) + ckf(2)*(d2lgdp-trend)+ckf(3)*D2LRULC(-2)+[var=exp(ckf(4))] 'Measurement equation
ss1.append #state trend = 1*trend(-1) + [var = exp(ckf(5))] 'State equation for trend GDP growth (random walk)
ss1.append #state lag = 1*lag(-1) + [var = exp(ckf(6))] 'State equation for lagged dependent variable (random walk)
'Apply priors to state space
ss1.append #mprior mprior
ss1.append #vprior vprior
'Set parameter starting values
param ckf(2) -0.0495 ckf(3) 0.01942 ckf(4) -2.8913 ckf(5) -4.1757 ckf(6) -6.2466 'starting values for parameters
'=====================
' ESTIMATE THE MODEL
'=====================
'Estimate state space
smpl %estsd %ested 'Estimation sample
ss1.ml(m=500,showopts) 'Estimate Kalman filter by maximum likelihood
freeze(mytab) ss1.stats
I'm really not sure how to do this with the dlm package. I've tried the following:
buildSS <- function(v){
dV <- exp(v[1]) # Variance of the measurment equation (ckf4)
dW <- c(exp(v[2]), # variance of the lagged dep (ckf6)
0, # variance of the coef on d2lgdp ckf(2) set to 0
0, # variance of the coef on d2lrulc ckf(3) set to 0
exp(v[3]) # variance of the random walk intercept (ckf5)
)
beta.vec <- c(1,v[4],v[5],1) # Params ckf(2) ckf3(3)
okuns <- dlmModReg(mod_data.tvp[,-1], addInt = TRUE, dV =dV, dW = dW, m0 = beta.vec)
}
#'Set parameter starting values
ckf4Guess <- -2.8913
ckf2guess <- -0.0495
ckf3guess <- 0.01942
ckf5guess <- -4.1757
ckf6guess <- -6.2466
params <- c(ckf4Guess,
ckf5guess,
ckf6guess,
ckf2guess,
ckf3guess)
tvp_mod.mle <- dlmMLE(mod_data.tvp[,"dur"] , parm = params, build = buildSS)
tvp_mod <- buildSS(tvp_mod.mle$par)
tvp_filter <- dlmFilter(mod_data$dur,tvp_mod)
The above code runs, but the outputs are not correct. I am not specifying the the states properly. Does anyone have any experience in building dlms with mutlvirate regression in R?
I think I have gotten to a solution - I've managed to recreate the estimates in the paper which estimates this model using Eviews (also checked this using Eviews).
#--------------------------------------------------------------------------------------------------------------------------
# tvp model full model - dur = alpha*dur(-1)+ beta(dgdp-potential) + gamma*wages
#--------------------------------------------------------------------------------------------------------------------------
# Construct DLM
OkunsDLMfm <- dlm(
FF = matrix(c(1,1,1,1),ncol = 4, byrow = TRUE),
V = matrix(1),
GG = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
W = matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1), ncol = 4, byrow = TRUE),
JFF = matrix(c(1,2,3,0),ncol = 4, byrow = TRUE),
X = cbind(mod_data$dur_lag1,mod_data$d2lgdp, mod_data$d2lrulc_lag2), # lagged dep var, dgdp, wages.
m0 = c(0,0,0,0),
C0 = matrix(c(1e+07,0,0,0,
0,1e+07,0,0,
0,0,1e+07,0,
0,0,0,1e+07), ncol = 4, byrow = TRUE)
)
buildOkunsFM <- function(p){
V(OkunsDLMfm) <- exp(p[2])
GG(OkunsDLMfm)[1,1] <- 1
GG(OkunsDLMfm)[2,2] <- 1
GG(OkunsDLMfm)[3,3] <- 1
GG(OkunsDLMfm)[4,4] <- 1
W(OkunsDLMfm)[1,1] <- exp(p[3])
W(OkunsDLMfm)[2,2] <- 0
W(OkunsDLMfm)[3,3] <- 0
W(OkunsDLMfm)[4,4] <- exp(p[4])
m0(OkunsDLMfm) <- c(0,0,0,p[1]*4)
C0(OkunsDLMfm)[1,1] <- 1
C0(OkunsDLMfm)[4,4] <- 5
return(OkunsDLMfm)
}
okuns.estfm <- dlmMLE(y = mod_data$dur, parm = c(-0.049,-1.4,-6,-5), build = buildOkunsFM)
OkunsDLM1fm <- buildOkunsFM(okuns.estfm$par)
The time varying level, the estimate of potential output, is derived by dividing the 4 element of the state vector by the second * by negative 1.
Not sure if this is best way to specify the DLM, but the results from the model are very close to what is reported (within 0.01) of the results from using Eviews. That being said, very open to any other specifications.
I wrote this code to run a test statistic on two randomly distributed observations x and y
mean.test <- function(x, y, B=10000,
alternative=c("two.sided","less","greater"))
{
p.value <- 0
alternative <- match.arg(alternative)
s <- replicate(B, (mean(sample(c(x,y), B, replace=TRUE))-mean(sample(c(x,y), B, replace=TRUE))))
t <- mean(x) - mean(y)
p.value <- 2*(1- pnorm(abs(quantile(T,0.01)), mean = 0, sd = 1, lower.tail =
TRUE, log.p = FALSE)) #try to calculate p value
data.name <- deparse(substitute(c(x,y)))
names(t) <- "difference in means"
zero <- 0
names(zero) <- "difference in means"
return(structure(list(statistic = t, p.value = p.value,
method = "mean test", data.name = data.name,
observed = c(x,y), alternative = alternative,
null.value = zero),
class = "htest"))
}
the code uses a Monte-Carlo simulations to generate the distribution function of the test statistic mean(x) - mean(y) and then calculates the p-value, but apparently i miss defined this p-value because for :
> set.seed(0)
> mean.test(rnorm(1000,3,2),rnorm(2000,4,3))
the output should look like:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
but i got this instead:
mean test
data: c(rnorm(1000, 3, 2), rnorm(2000, 4, 3))
difference in means = -1.0967, p-value = 0.8087
alternative hypothesis: true difference in means is not equal to 0
can someone explain the bug to me ?
As far as I can tell, your code has numerous mistakes and errors in it:
quantile(T, 0.01) - here T == TRUE, so you're calculating the quantile of 1.
The object s is never used.
mean(sample(c(x,y), B, replace=TRUE)) What are you trying to do here? The c() function combines x and y. Sampling makes no sense since you don't know what population they come from
When you calculate the test statistic t, it should depend on the variance (and sample size).
I am using 'KFAS' package from R to estimate a state-space model with the Kalman filter. My measurement and transition equations are:
y_t = Z_t * x_t + \eps_t (measurement)
x_t = T_t * x_{t-1} + R_t * \eta_t (transition),
with \eps_t ~ N(0,H_t) and \eta_t ~ N(0,Q_t).
So, I want to estimate the variances H_t and Q_t, but also T_t, the AR(1) coefficient. My code is as follows:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
fit <- fitSSM(ss_model, inits = c(0,0.6,0), method = 'L-BFGS-B')
But it returns: "Error in is.SSModel(do.call(updatefn, args = c(list(inits, model), update_args)),: System matrices (excluding Z) contain NA or infinite values, covariance matrices contain values larger than 1e+07"
The NA definitions for the variances works well, as documented in the package's paper. However, it seems this cannot be done for the AR coefficients. Does anyone know how can I do this?
Note that I am aware of the SSMarima function, which eases the definition of the transition equation as ARIMA models. Although I am able to estimate the AR(1) coef. and Q_t this way, I still cannot estimate the \eps_t variance (H_t). Moreover, I am migrating my Kalman filter codes from EViews to R, so I need to learn SSMcustom for other models that are more complicated.
Thanks!
It seems that you are missing something in your example, as your error message comes from the function fitSSM. If you want to use fitSSM for estimating general state space models, you need to provide your own model updating function. The default behaviour can only handle NA's in covariance matrices H and Q. The main goal of fitSSM is just to get started with simple stuff. For complex models and/or large data, I would recommend using your self-written objective function (with help of logLik method) and your favourite numerical optimization routines manually for maximum performance. Something like this:
library(KFAS)
set.seed(100)
eps <- rt(200, 4, 1)
meas <- as.matrix((arima.sim(n=200, list(ar=0.6), innov = rnorm(200)*sqrt(0.5)) + eps),
ncol=1)
Zt <- 1
Ht <- matrix(NA)
Tt <- matrix(NA)
Rt <- 1
Qt <- matrix(NA)
ss_model <- SSModel(meas ~ -1 + SSMcustom(Z = Zt, T = Tt, R = Rt,
Q = Qt), H = Ht)
objf <- function(pars, model, estimate = TRUE) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
if (estimate) {
-logLik(model)
} else {
model
}
}
opt <- optim(c(1, 0.5, 1), objf, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100), model = ss_model)
ss_model_opt <- objf(opt$par, ss_model, estimate = FALSE)
Same with fitSSM:
updatefn <- function(pars, model) {
model$H[1] <- pars[1]
model$T[1] <- pars[2]
model$Q[1] <- pars[3]
model
}
fit <- fitSSM(ss_model, c(1, 0.5, 1), updatefn, method = "L-BFGS-B",
lower = c(0, -0.99, 0), upper = c(100, 0.99, 100))
identical(ss_model_opt, fit$model)