Friedman's test manual calculation - r

I have obtained a negative value from the Friedman's test. The data is:
Full MIC ReliefF LCorrel InfoGain
equinox 69.939 80.178 78.794 75.205 62.268
lucene 78.175 84.103 79.017 82.044 75.564
mylyn 75.531 78.006 77.161 47.711 81.575
pde 70.282 82.686 81.884 75.07 79.476
jdt 71.675 93.202 95.387 85.878 82.818
Ranking is below
Full MIC ReliefF LCorrel InfoGain
equinox 2 5 4 3 1
lucene 2 5 3 4 1
mylyn 2 4 3 1 5
pde 1 5 4 2 3
jdt 1 4 5 3 2
Sum 8 23 19 13 12
The Friedman's F Calculation formula:
F = (5/[5*5*(5+1)] * [8*8 + 23*23 + 19*19 + 13*13 + 12*12] - [5*5*(5+1)]
The value I get is -107.7666667.
How do I interpret that? The examples I have seen all have positive result.
I know about the R code but want the manual calculation.

This is how I generated the results and it worked
pacc_part
f1 <- friedman.test(pacc_part)
print (f1)
# Post-hoc tests are conducted only if omnimus Kruskal-Wallis test p-value
is 0.05 or less.
if ( f1$p.value < 0.05 )
{
n1 <- posthoc.friedman.nemenyi.test(pacc_part)
}
n1;
# alternate representation of post-hoc test results
summary(n1);

Related

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

Error message in Firth's logistic regression

I have produced a logistic regression model in R using the logistf function from the logistf package due to quasi-complete separation. I get the error message:
Error in solve.default(object$var[2:(object$df + 1), 2:(object$df + 1)]) :
system is computationally singular: reciprocal condition number = 3.39158e-17
The data is structured as shown below, though a lot of the data has been cut here. Numbers represent levels (i.e 1 = very low, 5 = very high) not count data. Variables OrdA to OrdH are ordered factors. The variable Binary is a factor.
OrdA OrdB OrdC OrdE OrdF OrdG OrdH Binary
1 3 4 1 1 2 1 1
2 3 4 5 1 3 1 1
1 3 2 5 2 4 1 0
1 1 1 1 3 1 2 0
3 2 2 2 1 1 1 0
I have read here that this can be caused by multicollinearity, but have tested this and it is not the problem.
VIFModel <- lm(Binary ~ OrdA + OrdB + OrdC + OrdD + OrdE +
OrdF + OrdG + OrdH, data = VIFdata)
vif(VIFModel)
GVIF Df GVIF^(1/(2*Df))
OrdA 6.09 3 1.35
OrdB 3.50 2 1.37
OrdC 7.09 3 1.38
OrdD 6.07 2 1.57
OrdE 5.48 4 1.23
OrdF 3.05 2 1.32
OrdG 5.41 4 1.23
OrdH 3.03 2 1.31
The post also indicates that the problem can be caused by having "more variables than observations." However, I have 8 independent variables and 82 observations.
For context each independent variable is ordinal with 5 levels, and the binary dependent variable has 30% of the observations with "successes." I'm not sure if this could be associated with the issue. How do I fix this issue?
X <- model.matrix(Binary ~ OrdA+OrdB+OrdC+OrdD+OrdE+OrdF+OrdG+OrdH,
Data3, family = "binomial"); dim(X); Matrix::rankMatrix(X)
[1] 82 24
[1] 23
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.820766e-14
Short answer: your ordinal input variables are transformed to 24 predictor variables (number of columns of the model matrix), but the rank of your model matrix is only 23, so you do indeed have multicollinearity in your predictor variables. I don't know what vif is doing ...
You can use svd(X) to help figure out which components are collinear ...

How to include the interaction between a covariate and time for a non-proportional hazards model?

How to include the interaction between a covariate and and time for a non-proportional hazards model?
I often find that the proportional hazards assumption for the Cox regressions doesn’t hold.
Take the following data as an example.
> head(data2)
no np_p age_dx1 race1 mr_dx er_1 pr_1 sct_1 surv_mo km_stts1
1 20 1 2 4 1 2 2 4 52 1
2 33 1 3 1 2 1 2 1 11 1
3 67 1 2 4 4 1 1 3 20 1
4 90 1 3 1 3 3 3 2 11 1
5 143 1 2 4 3 1 1 2 123 0
6 180 1 3 1 3 1 1 2 9 1
First, I fitted a Cox regression model.
> fit2 <- coxph(Surv(surv_mo, km_stts1) ~ np_p + age_dx1 + race1 + mr_dx + er_1 + pr_1 + sct_1, data = data)
Second, I assessed the proportional hazards assumption.
> check_PH2 <- cox.zph(fit2, transform = "km")
> check_PH2
rho chisq p
np_p 0.00946 0.0748 7.84e-01
age_dx1 -0.00889 0.0640 8.00e-01
race1 -0.03148 0.7827 3.76e-01
mr_dx -0.03120 0.7607 3.83e-01
er_1 -0.14741 18.5972 1.61e-05
pr_1 0.05906 2.9330 8.68e-02
sct_1 0.17651 23.8030 1.07e-06
GLOBAL NA 53.2844 3.26e-09
So, this means that the hazard function of er_1 and sct_1 were nonproportional over time (Right?).
In my opinion, I can include the interaction between these two covariates and time seperately in the model. But I don't know how to perform it using R.
Thank you.

Error in fisher.test : Bug in fexact3, it[i=6]=0: negative key (kyy=91)

I have this table abd I want to analyse it statistically.
table(sci$category, sci$true_group)
mono sim_rus_nen suc_balanced suc_nen_rus suc_rus_nen
generalization 9 3 9 4 3
description 35 16 15 13 17
scheme 2 1 1 1 2
syncretism 5 3 7 16 2
tautology 2 2 2 3 3
substitution 1 0 0 0 0
indefinite 7 5 5 6 9
no_answer 30 17 18 13 19
So I decided to apply Fisher's exact test. But I have this error (although it's OK with chiq.square)
fisher.test(table(sci$category, sci$true_group))
Error in fisher.test(my_tab) : Bug in fexact3, it[i=6]=0: negative
key -1099365618 (kyy=91)
How can I fix this?
For larger contingency table/ counts, it gets resource intensive, to count all the worst cases to arrive at the p-value (seems to be that error).
So its convenient to simulate the p-values for tables larger than (2x2):
df <- table(sci$category, sci$true_group)
fisher.test(df, simulate.p.value = TRUE, B = 1e6)
Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06 replicates)
data: df
p-value = 0.1054
alternative hypothesis: two.sided
PS: Choosing between Fishers exact test vs Chisq-test is a whole another discussion. I would refer you to this cross validated post for clarity: Alternatives to chisq-test

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

Resources