I have 4 raters who have rated 10 subjects. Because I have multiple raters (and in my actual dataset, these 4 raters have rated the 10 subjects on multiple variables) I've chosen to use the Light's Kappa to calculate inter-rater reliability. I've run the light's kappa code shown below and included an example of my data.
My question is that the resulting kappa value (kappa=0.545) is fairly low even though the raters agree on almost all ratings? I'm not sure if there is some other way to calculate inter-rater reliability (e.g., pairwise combinations between raters?)
Any help is appreciated.
subjectID<- c(1,2,3,4,5,6,7,8,9,10)
rater1<- c(3,2,3,2,2,2,2,2,2,2)
rater2<-c(3,2,3,2,2,2,2,2,2,2)
rater3<- c(3,2,3,2,2,2,2,2,2,2)
rater4<-c(3,2,1,2,2,2,2,2,2,2)
df <- data.frame(subjectID, rater1, rater2, rater3, rater4)
kappam.light(df)
Related
I've been trying to estimate a pVAR using GMM on R using the package panelvar. I'm estimating a dynamic panel VAR with two-step GMM using first-differences.
I have a balanced panel with 378 with a group variable (id) and a time variable (year). I have 14 observations per group (unvaried) and 27 groups in total. In total, I have 120 instruments. I'm a bit concerned about the results of the Hansen J-test and I'm looking for some explanations: I have a Hansen J-test statistic of 0 with a p-value of 1. To my understanding, this would mean that the model is correctly specified. But the fact that the p-value is very high (1.000), it might mean that something deeper is going on.
In my estimation, I have 7 dependent variables, 2 exogenous variables, and I'm using 4 lagged instruments per dependent variable. Why is the p-value of the Hansen test very high?
Thanks in advance!
I have 5 raters who have rated 10 subjects. I've chosen to use the Light's Kappa to calculate inter-rater reliability because I have multiple raters. My issue is that when there is strong agreement between the raters, Light's kappa cannot be calculated due to lack of variability, and I've followed the updated post here which suggests using the Raters package in R when there is strong agreement.
My issue is that the Raters package calculates Fleiss' kappa, which from my understanding,is not suitable for inter-rater reliability where the same raters rate all subjects (such as in my case). My question is what type of kappa statistic I should be calculating in cases where there is strong agreement?
#install.packages("irr")
library(irr)
#install.packages('raters')
library(raters)
#mock dataset
rater1<- c(1,1,1,1,1,1,1,1,0,1)
rater2<- c(1,1,1,1,1,1,1,1,1,1)
rater3<- c(1,1,0,1,1,0,1,1,1,1)
rater4<- c(1,1,1,1,1,1,1,1,1,1)
rater5<- c(1,1,1,1,1,1,0,1,1,1)
df <- data.frame(rater1, rater2, rater3, rater4, rater5)
#light's kappa
kappam.light(df)
#kappa using raters package
data(df)
concordance(df, test = 'Normal')
For more context, the raters evaluate each target on multiple dimensions where some of them are categorical and others are ordinal. We have at least 2 raters (may have more) who are evaluating these targets and we want to see if the raters are reliable.
Is there an R package that would help me analyze the interrater reliability on a multiquestion assessment?
I am conducting a propensity score match analysis on the outcome of two different new cancer treatments where the outcome is binary (cancer-free or not cancer free). Following successful matching I get my paired 2x2 contingency table for my outcome between my matched pairs which looks like below;
**Treatment 1**
Not-Cancer Free Cancer Free
**Treatment 2** Not-Cancer Free 50 39
Cancer Free 53 60
I'd like to compare the outcomes to figure out if one treatment is better than the other by comparing odds ratios of being cancer free. I've been advised to conduct a McNemar's test due to the matched nature of the data which I do and get a p-value of 0.17 (non-significant). However, I've also been advised that instead of simply using the odds ratio normally used for such 2x2 tables (B/C --> 39/53 = 0.78 OR) that I should calculate the odds ratio and 95% confidence intervals using the methods shown in Agresti Alan, Min Yongyi. Effects and non‐effects of paired identical observations in comparing proportions with binary matched‐pairs data. Statistics in medicine. 2004 Jan 15;23(1):65-75. as it accounts for the matched nature of the data.
Unfortunately after reading this paper numerous times (especially it's odds ratio section) I can't figure out what the equations given for the odds ratio and 95% CI calculations are that they are referring to but know that they must be in there somewhere as other papers have cited this paper when referring to their odds ratios but don't share their methodology making it difficult to traceback.
If anyone has read this paper or has experience with odds ratios for matched binary data, can you please let me know how I can go about to get matched pair odds ratios. Thank you incredibly much in advance!
You can use McNemar exact test for the paired data. A point they are making in the paper and what the exact test uses are the off-diagonal elements (b,c) in the calculations. You can use exact2x2 package (https://cran.r-project.org/web/packages/exact2x2/exact2x2.pdf) to get the test results with 95%CI:
library(exact2x2)
# Set up your data as matrix
x<-matrix(c(50,53,39,60),2,2)
mcnemar.exact(x)
Gives:
Exact McNemar test (with central confidence intervals)
data: x
b = 39, c = 53, p-value = 0.175
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.4738071 1.1339142
sample estimates:
odds ratio
0.7358491
I would like to perform a bootstrapped paired t-test in R. I have tried this for multiple datasets that returned p<.05 when using a parametric paired t-test however when I run the bootstrap I get p-values between 0.4 and 0.5. Am I running this incorrectly?
differences<-groupA-groupB
t.test(differences) #To get the t-statistic e.g. 1.96
Repnumber <- 10000
tstat.values <- numeric(Repnumber)
for (i in 1:Repnumber) {
group1 = sample(differences, size=length(differences), replace=T)
tstat.values[i] = t.test(group1)$statistic
}
#### To get the bootstrap p-value compare the # of tstat.values
greater (or lesser) than or equal to the original t-statistic divided
by # of reps:
sum(tstat.values<=-1.96)/Repnumber
Thank you!
It looks like you're comparing apples and oranges. For the single t-test of differences you're getting a t-statistic, which, if greater than a critical value indicates whether the difference between group1 and group2 is significantly different from zero. Your bootstrapping code does the same thing, but for 10,000 bootstrapped samples of differences, giving you an estimate of the variation in the t-statistic over different random samples from the population of differences. If you take the mean of these bootstrapped t-statistics (mean(tstat.values)) you'll see it's about the same as the single t-statistic from the full sample of differences.
sum(tstat.values<=-1.96)/Repnumber gives you the percentage of bootstrapped t-statistics less than -1.96. This is an estimate of the percentage of the time that you would get a t-statistic less than -1.96 in repeated random samples from your population. I think this is essentially an estimate of the power of your test to detect a difference of a given size between group1 and group2 for a given sample size and significance level, though I'm not sure how robust such a power analysis is.
In terms of properly bootstrapping the t-test, I think what you actually need to do is some kind of permutation test that checks whether your actual data is an outlier when compared with repeatedly shuffling the labels on your data and doing a t-test on each shuffled dataset. You might want to ask a question on CrossValidated, in order to get advice on how to do this properly for your data. These CrossValidated answers might help: here, here, and here.