I am trying to calculate cohen's kapa values for multiple teacher-segment permutations. In this exampe, there are six unique teacher-segment combinations. For example, teacher1-segement1 has two different raters, and would like to see the ICC of these two raters for that teacher1-segement1 (and all the other teacher-segment permuations).
I have a data set such as this.
full.data <- read_table2('Rater teacher segment subject1 subject2 subject3
A 1 1 1 4 1
B 1 1 3 4 3
B 2 2 2 3 2
C 2 2 1 4 1
D 3 1 4 4 4
E 3 1 4 3 4
D 4 2 3 3 3
A 4 2 4 3 4
B 5 2 4 3 4
A 5 2 5 3 5
D 6 1 5 3 5
E 6 1 5 3 5')
I know that if I wanted to get cohen's kapa for just one teacher-segment group, I would tranform the data such as this,
one.permuation<- read_table2('Rater RaterA-teacher1-segment1 RaterB-teacher1-segment1
subject1 1 3
subject2 4 4
subject3 1 3')
and then run,
library(irr)
print(icc(myRatings, model=“twoway”, type=“consistency”, unit=“average”))
Which would give me just ONE kapa value for that particular teacher-segment.
How would I get the values for all the teacher-segment permutations at once? (each group of teacher,segment, has a different observer)?
How do I present these 6 different Kapa values in a way that makes sense? I've never done something like this before; hoping to get some insight from experienced stat folks.
Although not shown here, raters have both an ordinal and nominal scale response (1-4 score) and [yes, No]. Should I be using a different kappa function for these different kinds of scales? From the "Psych" library documentation: "Cohen's kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) may be used to find the agreement of two raters when using nominal scores."
Here is what I tried for you. You said that you want to calculate Cohen's Kappa values. So I decided to use cohen.kappa() in the psych package, rather than icc(), which I am not familiar with. I hope you do not mind that. They key thing was to transform your data in a way that you can run cohen.kappa() all together. Seeing your one.permuation, I tried to create a data frame that has teacher, segment, subject, and raters (A, B, C, D, and E) as columns. pivot_longer() and pivot_wider() handled this. Then, I needed to move numeric values to two columns (rowwise value sorting). I used Ananda Mahto's SOfun package. (Ananda is the author of splitstackshape package.) Then, I grpup the data by teacher and segment and created lists. For each list that contains a data frame, I converted the data frame to matrix and applied cohen.kappa() and obtained results with tidy(). Finally, I used unnest() to see the results.
library(tidyverse)
library(psych)
library(devtools)
install_github("mrdwab/SOfun")
library(SOfun)
library(broom)
pivot_longer(full.data, cols = subject1:subject3,
names_to = "subject", values_to = "rating_score") %>%
pivot_wider(id_cols = c("teacher", "segment", "subject"),
names_from = "Rater", values_from = "rating_score") %>%
as.matrix %>%
naLast(by = "row") %>%
as_tibble %>%
select(-c(subject, C:E)) %>%
type_convert() %>%
group_by(teacher, segment) %>%
nest() %>%
mutate(result = map(.x = data,
.f = function(x) cohen.kappa(as.matrix(x)) %>% tidy())) %>%
unnest(result)
# teacher segment data type estimate conf.low conf.high
# <dbl> <dbl> <list<df[,2]>> <chr> <dbl> <dbl> <dbl>
# 1 1 1 [3 x 2] unweighted 0.25 -0.0501 0.550
# 2 1 1 [3 x 2] weighted 0.571 -0.544 1
# 3 2 2 [3 x 2] unweighted 0 0 0
# 4 2 2 [3 x 2] weighted 0.571 -1 1
# 5 3 1 [3 x 2] unweighted 0 0 0
# 6 3 1 [3 x 2] weighted 0 0 0
# 7 4 2 [3 x 2] unweighted 0 0 0
# 8 4 2 [3 x 2] weighted 0 0 0
# 9 5 2 [3 x 2] unweighted 0.25 -0.0501 0.550
#10 5 2 [3 x 2] weighted 0.571 -0.544 1
#11 6 1 [3 x 2] unweighted 1 1 1
#12 6 1 [3 x 2] weighted 1 1 1
icc version
The data transformation is basically the same. You need to work a bit more when you run multiple stats. icc() returns icclist object. You want to create data frames from the object.
library(irr)
pivot_longer(full.data, cols = subject1:subject3,
names_to = "subject", values_to = "rating_score") %>%
pivot_wider(id_cols = c("teacher", "segment", "subject"),
names_from = "Rater", values_from = "rating_score") %>%
as.matrix %>%
naLast(by = "row") %>%
as_tibble %>%
select(-c(subject, C:E)) %>%
mutate_at(vars(A:B), .funs = list(~as.numeric(.))) %>%
group_by(teacher, segment) %>%
nest() %>%
mutate(result = map(.x = data,
.f = function(x) enframe(unlist(icc(x,
model = "twoway",
type = "consistency",
unit = "average"))) %>%
pivot_wider(names_from = "name",
values_from = "value"))) %>%
unnest(result)
teacher segment data subjects raters model type unit icc.name value r0 Fvalue df1 df2 p.value conf.level lbound ubound
<chr> <chr> <list<d> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) 0.75 0 4 2 2 0.2 0.95 -8.74~ 0.993~
2 2 2 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) 0.75 0 4 2 2 0.2 0.95 -8.75 0.993~
3 3 1 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) 4.99~ 0 1 2 2 0.5 0.95 -38 0.974~
4 4 2 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) -8.3~ 0 0.999~ 2 2 0.5 0.95 -38 0.974~
5 5 2 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) 0.88~ 0 8.999~ 2 2 0.1 0.95 -3.33~ 0.997~
6 6 1 [3 x 2] 3 2 twow~ cons~ aver~ ICC(C,2) 1 0 Inf 2 2 0 0.95 1 1
Related
This is an example dataframe
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
I have column names in my dataframe in R in this format
\S+_T\d+|\S+_T\d+
The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)
Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|"
I did this with some manual labor :
means_t0 <- means2 %>% select(matches("\\S+_T0\\|\\S+_T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S+_T1\\|\\S+_T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S+_T5\\|\\S+_T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)
This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?
Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same
For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained
Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)
Thank you
If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)
library(tidyr)
library(dplyr)
library(stringr)
res <- means2 %>%
pivot_longer(
cols = -id_cp_interaction,
names_to = "names",
values_to = "values"
) %>%
mutate(
celltype_1 = str_extract(names, "^[^_]*"),
timepoint_1 = str_extract(names, "[0-9](?=|)"),
celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
timepoint_2 = str_extract(names, "[0-9]$")
)
head(res, n = 7)
#> # A tibble: 7 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|B… 1.68 B 0 B 0
#> 2 1 B_T0|B… 19.3 B 0 B 1
#> 3 1 B_T0|F… 10.6 B 0 Fibro 0
#> 4 1 B_T5|E… 12.5 B 5 Endo 5
#> 5 1 Macro_… 2.84 Macro 1 Fibro 1
#> 6 2 B_T0|B… 2.17 B 0 B 0
#> 7 2 B_T0|B… 10.1 B 0 B 1
# only keep interactions of different cell types
res %>%
filter(celltype_1 != celltype_2) %>%
head()
#> # A tibble: 6 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|F… 10.6 B 0 Fibro 0
#> 2 1 B_T5|E… 12.5 B 5 Endo 5
#> 3 1 Macro_… 2.84 Macro 1 Fibro 1
#> 4 2 B_T0|F… 1.47 B 0 Fibro 0
#> 5 2 B_T5|E… 11.3 B 5 Endo 5
#> 6 2 Macro_… 13.0 Macro 1 Fibro 1
Created on 2022-09-19 by the reprex package (v1.0.0)
I am exploring the tidyverse package. So I am interested in how to get the following task down in the tidy way. One can easily circumvent the problem using *apply functions.
Consider the following data
tb <-
lapply(matrix(c("a", "b", "c")), function(x)
rep(x, 3)) %>% unlist %>% c(rep(c(1, 2, 3), 6)) %>% matrix(ncol = 3) %>%
as_tibble(.name_repair = ~ c("tag", "x1", "x2")) %>% type.convert()
# A tibble: 9 x 3
tag x1 x2
<fct> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 1 1
5 b 2 2
6 b 3 3
7 c 1 1
8 c 2 2
9 c 3 3
I group them using nest() function and for each group I want to apply a different function from a list of functions f_1, f_2, f_3
f_1 <- function(x)
x[,1] + x[,2]
f_2 <- function(x)
x[,1] - x[,2]
f_3 <- function(x)
x[,1] * x[,2]
tb_func_attached <-
tb %>% group_by(tag) %>% nest() %>% mutate(func = c(f_0, f_1, f_2))
# A tibble: 3 x 3
tag data func
<fct> <list> <list>
1 a <tibble [3 x 2]> <fn>
2 b <tibble [3 x 2]> <fn>
3 c <tibble [3 x 2]> <fn>
I try to use invoke_map to apply the functions
tb_func_attached %>% {invoke_map(.$func, .$data)}
invoke_map(tb_func_attached$func, tb_func_attached$data)
But I get the error Error in (function (x) : unused arguments (x1 = 1:3, x2 = 1:3), while the following code runs
> tb_func_attached$func[[1]](tb_func_attached$data[[1]])
x1
1 2
2 4
3 6
> tb_func_attached$func[[2]](tb_func_attached$data[[2]])
x1
1 0
2 0
3 0
> tb_func_attached$func[[3]](tb_func_attached$data[[3]])
x1
1 1
2 4
3 9
But invoke_map still does not work.
So the question is, given a nested data tb_func_attached, how to apply the functions tb_func_attached$func 'rowwisely' to tb_func_attached$data?
And a side question, what is the reason for the retirement of invoke_map? It fits quitely well in the concept of vetorisation, IMHO.
Update:
The previous version dealt with single column data (tb has only tag and x1 columns) and #A. Suliman's comment provides a solution.
However when the data column in the nested tibble has a matrix structure, the code stops running again.
Use map2 to iterate over the list of functions first, and over the data column second. Like this:
tb_func_attached %>%
mutate(output = map2(func, data, ~ .x(.y))) %>%
unnest(data, output)
The output looks this way:
# A tibble: 9 x 4
tag x1 x2 x11
<fct> <int> <int> <int>
1 a 1 1 2
2 a 2 2 4
3 a 3 3 6
4 b 1 1 0
5 b 2 2 0
6 b 3 3 0
7 c 1 1 1
8 c 2 2 4
9 c 3 3 9
I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])
I am trying to create 14 separate dataframes in order to run an ANOVA on each of them separately. (If there is a way to run 14 separate ANOVA's without creating these dataframes, please let me know.) I have a large, tidy dataframe with a column "number" that specifies the question number.
id number value
1 1 2
1 2 1
1 3 4
2 1 4
2 2 3
2 3 4
I know I could create individual dataframes one-by-one in the following way:
df1 <- filter(df, number == 1)
df2 <- filter(df, number == 2)
df3 <- filter(df, number == 3)
I am wondering if there is a more efficient way of creating these dataframes.
Using dplyr::do, we can easily do the calculation for each group. I also show the improved/standardised result if we use broom::tidy on the anova results (consistent variable names for one), see the broom and dplyr vignette for more.
library(tidyverse)
tbl <- read_table2(
"id number value
1 1 2
1 2 1
1 3 4
2 1 4
2 2 3
2 3 4"
)
tbl %>%
group_by(number) %>%
do(anova(lm(value ~ number, data = .)))
#> Warning in anova.lm(lm(value ~ number, data = .)): ANOVA F-tests on an
#> essentially perfect fit are unreliable
#> # A tibble: 3 x 6
#> # Groups: number [3]
#> number Df `Sum Sq` `Mean Sq` `F value` `Pr(>F)`
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2.00e+ 0 2.00e+ 0 NA NA
#> 2 2 1 2.00e+ 0 2.00e+ 0 NA NA
#> 3 3 1 7.89e-31 7.89e-31 NA NA
library(broom)
tbl %>%
group_by(number) %>%
do(tidy(anova(lm(value ~ number, data = .))))
#> Warning in anova.lm(lm(value ~ number, data = .)): ANOVA F-tests on an
#> essentially perfect fit are unreliable
#> # A tibble: 3 x 7
#> # Groups: number [3]
#> number term df sumsq meansq statistic p.value
#> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Residuals 1 2.00e+ 0 2.00e+ 0 NA NA
#> 2 2 Residuals 1 2.00e+ 0 2.00e+ 0 NA NA
#> 3 3 Residuals 1 7.89e-31 7.89e-31 NA NA
Created on 2018-04-26 by the reprex package (v0.2.0).
I am trying to calculate the Jaccard similarity between a source vector and comparison vectors in a tibble.
First, create a tibble with a names_ field (vector of strings). Using dplyr's mutate, create names_vec, a list-column, where each row is now a vector (each element of vector is a letter).
Then, create a new tibble with column jaccard_sim that is supposed to calculate the Jaccard similarity.
source_vec <- c('a', 'b', 'c')
df_comp <- tibble(names_ = c("b d f", "u k g", "m o c"),
names_vec = strsplit(names_, ' '))
df_comp_jaccard <- df_comp %>%
dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))
All the values in jaccard_sim are zero. However, if we run something like this, we get the correct Jaccard similarity of 0.2 for the first entry:
a <- length(intersect(source_vec, df_comp[[1,2]]))
b <- length(union(source_vec, df_comp[[1,2]]))
a/b
You could simply add rowwise
df_comp_jaccard <- df_comp %>%
rowwise() %>%
dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/
length(union(names_vec, source_vec)))
# A tibble: 3 x 3
names_ names_vec jaccard_sim
<chr> <list> <dbl>
1 b d f <chr [3]> 0.2
2 u k g <chr [3]> 0.0
3 m o c <chr [3]> 0.2
Using rowwise you get the intuitive behavior some would expect when using mutate : "do this operation for every row".
Not using rowwise means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful.
The impression that mutate (or other dplyr functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns.
I'll illustrate with a couple of examples:
Sometimes the result is the same, with a vectorized function such as paste:
tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_"))
tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_"))
# # A tibble: 5 x 3
# a b X
# <int> <int> <chr>
# 1 1 5 1_5
# 2 2 4 2_4
# 3 3 3 3_3
# 4 4 2 4_2
# 5 5 1 5_1
And sometimes it's different, with a function that is not vectorized, such as max:
tibble(a=1:5,b=5:1) %>% mutate(max(a,b))
# # A tibble: 5 x 3
# a b `max(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 5
# 3 3 3 5
# 4 4 2 5
# 5 5 1 5
tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b))
# # A tibble: 5 x 3
# a b `max(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 4
# 3 3 3 3
# 4 4 2 4
# 5 5 1 5
Note that in this case you shouldn't use rowwise in a real life situation, but pmax which is vectorized for this purpose:
tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b))
# # A tibble: 5 x 3
# a b `pmax(a, b)`
# <int> <int> <int>
# 1 1 5 5
# 2 2 4 4
# 3 3 3 3
# 4 4 2 4
# 5 5 1 5
Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.
We can use map to loop through the list
library(tidyverse)
df_comp %>%
mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x,
source_vec))/length(union(.x, source_vec))))
# A tibble: 3 x 3
# names_ names_vec jaccard_sim
# <chr> <list> <dbl>
#1 b d f <chr [3]> 0.2
#2 u k g <chr [3]> 0.0
#3 m o c <chr [3]> 0.2
The map functions are optimized. Below are the system.time for a slightly bigger dataset
df_comp1 <- df_comp[rep(1:nrow(df_comp), 1e5),]
system.time({
df_comp1 %>%
rowwise() %>%
dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))
})
#user system elapsed
# 25.59 0.05 25.96
system.time({
df_comp1 %>%
mutate(jaccard_sim = map_dbl(names_vec, ~length(intersect(.x,
source_vec))/length(union(.x, source_vec))))
})
#user system elapsed
# 13.22 0.00 13.22