Restructuring data (for IRR-analysis) - r

I have the following data-frame df (fictitious data) with several variables var1, var2, ..., var_n:
var1<-c("A","A","A","B","A","C","C","A", "A", "E", "E", "B")
var2<-c(NA,"1","1","5","6","2","3","1", "1", "3", "3", "2")
id<-c(1,2,2,3,3,4,4,5,5,6,6,7)
df<-data.frame(id, var1, var2)
df
id var1 var2
1 A <NA>
2 A 1
2 A 1
3 B 5
3 A 6
4 C 2
4 C 3
5 A 1
5 A 1
6 E 3
6 E 3
7 B 2
The data are retrieved from a document analysis where several coders extracted the values from physical files. Each file does have a specific id. Thus, if there are two entries with the same id this means that two different coders coded the same document. For example in document no. 4 both coders agreed that var1 has the value C, whereas in document no. 3 there is a dissent (A vs. B).
In order to calculate inter-rater-reliability (irr) I need to restructure the dataframe as follows:
id var1 var1_coder2 var2 var2_coder2
2 A A 1 5
3 B A 5 6
4 C C 2 3
5 C C 1 1
6 E E 3 3
Can anyone tell me how to get this done? Thanks!

You can transform your data with functions from dplyr (group_by, mutate) and tidyr (gather, spread, unite):
library(tidyr)
library(dplyr)
new_df <- df %>%
group_by(id) %>%
mutate(coder = paste0("coder_", 1:n())) %>%
gather("variables", "values", -id, -coder) %>%
unite(column, coder, variables) %>%
spread(column, values)
new_df
# A tibble: 7 x 5
# Groups: id [7]
# id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
# <dbl> <chr> <chr> <chr> <chr>
# 1 1 A NA NA NA
# 2 2 A 1 A 1
# 3 3 B 5 A 6
# 4 4 C 2 C 3
# 5 5 A 1 A 1
# 6 6 E 3 E 3
# 7 7 B 2 NA NA
If you only want to keep the rows where all coder have entered values you can use filter_all.
new_df %>%
filter_all(all_vars(!is.na(.)))
# A tibble: 5 x 5
# Groups: id [5]
# id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
# <dbl> <chr> <chr> <chr> <chr>
# 1 2 A 1 A 1
# 2 3 B 5 A 6
# 3 4 C 2 C 3
# 4 5 A 1 A 1
# 5 6 E 3 E 3

Related

Assigning an index value when there is repeated values in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 months ago.
I need to assign an index value when a value is repeated.
Here is a sample dataset.
df <- data.frame(id = c("A","A","B","C","D","D","D"))
> df
id
1 A
2 A
3 B
4 C
5 D
6 D
7 D
How can I get that indexing column as below:
> df1
id index
1 A 1
2 A 2
3 B 1
4 C 1
5 D 1
6 D 2
7 D 3
base R:
df$index <- ave(rep(1L, nrow(df)), df$id, FUN = seq_along)
df
# id index
# 1 A 1
# 2 A 2
# 3 B 1
# 4 C 1
# 5 D 1
# 6 D 2
# 7 D 3
Another option using n() like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(index = 1:n()) %>%
ungroup()
#> # A tibble: 7 × 2
#> id index
#> <chr> <int>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 C 1
#> 5 D 1
#> 6 D 2
#> 7 D 3
Created on 2022-09-23 with reprex v2.0.2

Create a dataframe with all observations unique for one specific column of a dataframe in R

I have a dataframe that I would like to reduce in size by extracting the unique observations. However, I would like to only select the unique observations of one column, and preserve the rest of the dataframe. Because there are certain other columns that have repeat values, I cannot simply put the entire dataframe in the unique function. How can I do this and produce the entire dataframe?
For example, with the following dataframe, I would like to only reduce the dataframe by unique observations of variable a (column 1):
a b c d e
1 2 3 4 5
1 2 3 4 6
3 4 5 6 8
4 5 2 3 6
Therefore, I only remove row 2, because "1" is repeated. The other rows/columns repeat values, but these observations are maintained, because I only assess the uniqueness of column 1 (a).
Desired outcome:
a b c d e
1 2 3 4 5
3 4 5 6 8
4 5 2 3 6
How can I process this and then retrieve the entire dataframe? Is there a configuration for the unique function to do this, or do I need an alternative?
base R
dat[!duplicated(dat$a),]
# a b c d e
# 1 1 2 3 4 5
# 3 3 4 5 6 8
# 4 4 5 2 3 6
dplyr
dplyr::distinct(dat, a, .keep_all = TRUE)
# a b c d e
# 1 1 2 3 4 5
# 2 3 4 5 6 8
# 3 4 5 2 3 6
Another option: per-group, pick a particular value from the duplicated rows.
library(dplyr)
dat %>%
group_by(a) %>%
slice(which.max(e)) %>%
ungroup()
# # A tibble: 3 x 5
# a b c d e
# <int> <int> <int> <int> <int>
# 1 1 2 3 4 6
# 2 3 4 5 6 8
# 3 4 5 2 3 6
library(data.table)
as.data.table(dat)[, .SD[which.max(e),], by = .(a) ]
# a b c d e
# <int> <int> <int> <int> <int>
# 1: 1 2 3 4 6
# 2: 3 4 5 6 8
# 3: 4 5 2 3 6
As for unique, it does not have incomparables argument, but it is not yet implemented:
unique(dat, incomparables = c("b", "c", "d", "e"))
# Error: argument 'incomparables != FALSE' is not used (yet)

How to conditionally update a R tibble using multiple conditions of another tibble

I have two tables. I would like to update the first table using a second table using multiple conditions. In base R I would use if...else type constructs to do this but would like to know how to achieve this using dplyr.
The table to be updated (have a field added) looks like this:
> Intvs
# A tibble: 12 x 3
Group From To
<chr> <dbl> <dbl>
1 A 0 1
2 A 1 2
3 A 2 3
4 A 3 4
5 A 4 5
6 A 5 6
7 B 0 1
8 B 1 2
9 B 2 3
10 B 3 4
11 B 4 5
12 B 5 6
The tibble that I would like to use to make the update looks like this:
>Zns
# A tibble: 2 x 4
Group From To Zone
<chr> <chr> <dbl> <dbl>
1 A X 1 5
2 B Y 3 4
I would like to update the Intvs tibble with the Zns tibble using the fields == Group, >= From, and <= To to control the update. The expected output should look like this
> Intvs
# A tibble: 12 x 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
What is the most efficient way to do this using dplyr?
The code below should make the dummy tables Intv and Zns
# load packages
require(tidyverse)
# Intervals table
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
names(Zns) <- c("Group", "From", "To", "Zone")
Using non-equi join from data.table
library(data.table)
setDT(Intvs)[Zns, Zone := Zone, on = .(Group, From >= From, To <= To)]
-output
> Intvs
Group From To Zone
<char> <num> <num> <char>
1: A 0 1 <NA>
2: A 1 2 X
3: A 2 3 X
4: A 3 4 X
5: A 4 5 X
6: A 5 6 <NA>
7: B 0 1 <NA>
8: B 1 2 <NA>
9: B 2 3 <NA>
10: B 3 4 Y
11: B 4 5 <NA>
12: B 5 6 <NA>
This is the closest I get. It is not giving the expected output:
library(dplyr)
left_join(Intvs, Zns, by="Group") %>%
group_by(Group) %>%
mutate(Zone1 = case_when(From.x <= Zone & From.x >= To.y ~ From.y)) %>%
select(Group, From=From.x, To=To.x, Zone = Zone1)
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 X
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 Y
12 B 5 6 NA
Not sure why your first row does not give NA, since 0 - 1 is not in the range of 1 - 5.
First left_join the two dataframes using the Group column. Here I assign the suffix "_Zns" to values from the Zns dataframe. Then use a single case_when or (ifelse) statement to assign NA to rows that do not fit the range. Finally, drop the columns that end with Zns.
library(dplyr)
left_join(Intvs, Zns, by = "Group", suffix = c("", "_Zns")) %>%
mutate(Zone = case_when(From >= From_Zns & To <= To_Zns ~ Zone,
TRUE ~ NA_character_)) %>%
select(-ends_with("Zns"))
# A tibble: 12 × 4
Group From To Zone
<chr> <dbl> <dbl> <chr>
1 A 0 1 NA
2 A 1 2 X
3 A 2 3 X
4 A 3 4 X
5 A 4 5 X
6 A 5 6 NA
7 B 0 1 NA
8 B 1 2 NA
9 B 2 3 NA
10 B 3 4 Y
11 B 4 5 NA
12 B 5 6 NA
Data
Note that I have changed your column name order in the Zns dataframe.
a <- c(rep("A", 6), rep("B", 6))
b <- c(seq(0,5,1), seq(0,5,1) )
c <- c(seq(1,6,1), seq(1,6,1))
Intvs <- bind_cols(a, b, c)
names(Intvs) <- c("Group", "From", "To")
# Zones table
a <- c("A", "B")
b <- c("X", "Y")
c <- c(1, 3)
d <- c(5, 4)
Zns <- bind_cols(a, b, c, d)
colnames(Zns) <- c("Group", "Zone", "From", "To")

Split information from two columns, R, tidyverse

i've got some data in two columns:
# A tibble: 16 x 2
code niveau
<chr> <dbl>
1 A 1
2 1 2
3 2 2
4 3 2
5 4 2
6 5 2
7 B 1
8 6 2
9 7 2
My desired output is:
A tibble: 16 x 3
code niveau cat
<chr> <dbl> <chr>
1 A 1 A
2 1 2 A
3 2 2 A
4 3 2 A
5 4 2 A
6 5 2 A
7 B 1 B
8 6 2 B
I there a tidy way to convert these data without looping through it?
Here some dummy data:
data<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2))
desired_output<-tibble(code=c('A', 1,2,3,4,5,'B', 6,7,8,9,'C',10,11,12,13), niveau=c(1, 2,2,2,2,2,1,2,2,2,2,1,2,2,2,2),
cat=c(rep('A', 6),rep('B', 5), rep('C', 5)))
Nicolas
Probably, you can create a new column cat and replace code values with NA where there is a number. We can then use fill to replace missing values with previous non-NA value.
library(dplyr)
data %>% mutate(cat = replace(code, grepl('\\d', code), NA)) %>% tidyr::fill(cat)
# A tibble: 16 x 3
# code niveau cat
# <chr> <dbl> <chr>
# 1 A 1 A
# 2 1 2 A
# 3 2 2 A
# 4 3 2 A
# 5 4 2 A
# 6 5 2 A
# 7 B 1 B
# 8 6 2 B
# 9 7 2 B
#10 8 2 B
#11 9 2 B
#12 C 1 C
#13 10 2 C
#14 11 2 C
#15 12 2 C
#16 13 2 C
We can use str_detect from stringr
library(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(cat = replace(code, str_detect(code, '\\d'), NA)) %>%
fill(cat)

Summarise all using which on other column in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
For some reason, I could not find a solution using the summarise_all function for the following problem:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,10,9))
desired results:
df %>%
group_by(A) %>%
summarise(B = B[which.min(D)],
C = C[which.min(D)],
D = D[which.min(D)])
# A tibble: 4 x 4
A B C D
<dbl> <int> <int> <dbl>
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
What I tried:
df %>%
group_by(A) %>%
summarise_all(.[which.min(D)])
In words, I want to group by a variable and find for each column the value that belongs to the minimum value of another column. I could not find a solution for this using summarise_all. I am searching for a dplyr approach.
You can just filter down to the row that has a minimum value of D for each level of A. The code below assumes there is only one minimum row in each group.
df %>%
group_by(A) %>%
arrange(D) %>%
slice(1)
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
If there can be multiple rows with minimum D, then:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,9,9))
df %>%
group_by(A) %>%
filter(D == min(D))
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 7 2 9
5 4 8 1 9
You need filter - any time you're trying to drop some rows and keep others, that's the verb you want.
df %>% group_by(A) %>% filter(D == min(D))
#> # A tibble: 4 x 4
#> # Groups: A [4]
#> A B C D
#> <dbl> <int> <int> <dbl>
#> 1 1 1 8 1
#> 2 2 2 7 2
#> 3 3 4 5 1
#> 4 4 8 1 9

Resources