Assigning an index value when there is repeated values in R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 months ago.
I need to assign an index value when a value is repeated.
Here is a sample dataset.
df <- data.frame(id = c("A","A","B","C","D","D","D"))
> df
id
1 A
2 A
3 B
4 C
5 D
6 D
7 D
How can I get that indexing column as below:
> df1
id index
1 A 1
2 A 2
3 B 1
4 C 1
5 D 1
6 D 2
7 D 3

base R:
df$index <- ave(rep(1L, nrow(df)), df$id, FUN = seq_along)
df
# id index
# 1 A 1
# 2 A 2
# 3 B 1
# 4 C 1
# 5 D 1
# 6 D 2
# 7 D 3

Another option using n() like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(index = 1:n()) %>%
ungroup()
#> # A tibble: 7 × 2
#> id index
#> <chr> <int>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 C 1
#> 5 D 1
#> 6 D 2
#> 7 D 3
Created on 2022-09-23 with reprex v2.0.2

Related

Remove groups if all NA

Let's say I have a table like so:
df <- data.frame("Group" = c("A","A","A","B","B","B","C","C","C"),
"Num" = c(1,2,3,1,2,NA,NA,NA,NA))
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
7 C NA
8 C NA
9 C NA
In this case, because group C has Num as NA for all entries, I would like to remove rows in group C from the table. Any help is appreciated!
You could group_by on you Group and filter the groups with all values that are NA. You can use the following code:
library(dplyr)
df %>%
group_by(Group) %>%
filter(!all(is.na(Num)))
#> # A tibble: 6 × 2
#> # Groups: Group [2]
#> Group Num
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B NA
Created on 2023-01-18 with reprex v2.0.2
In base R you could index based on all the groups that have at least one non-NA value:
idx <- df$Group %in% unique(df[!is.na(df$Num),"Group"])
idx
df[idx,]
# or in one line
df[df$Group %in% unique(df[!is.na(df$Num),"Group"]),]
output
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
Using ave.
df[with(df, !ave(Num, Group, FUN=\(x) all(is.na(x)))), ]
# Group Num
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B NA

how do i remove rows with value frequencies more than x in R [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 1 year ago.
suppose I have a data set:
x y
1 a
1 a
1 a
1 a
2 a
2 a
2 b
3 c
3 e
how do I delete rows whose x frequency repeats more than 3 (e.g. '1' that appear 4 times)?
base R
dat[ave(dat$x, dat$x, FUN=length) < 4,]
# x y
# 5 2 a
# 6 2 a
# 7 2 b
# 8 3 c
# 9 3 e
dplyr
library(dplyr)
dat %>%
group_by(x) %>%
filter(n() < 4) %>%
ungroup()
# # A tibble: 5 x 2
# x y
# <int> <chr>
# 1 2 a
# 2 2 a
# 3 2 b
# 4 3 c
# 5 3 e
data.table
library(data.table)
as.data.table(dat)[, .SD[.N < 4,], by = .(x)][]
# x y
# <int> <char>
# 1: 2 a
# 2: 2 a
# 3: 2 b
# 4: 3 c
# 5: 3 e

data.table: Select row with maximum value by group with several grouping variables [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
How to group data.table by multiple columns?
(2 answers)
Closed 2 years ago.
I am trying to subset the maximum (minimum / whatever) value by groups. They are defined by more than one grouping variable.
My working workaround is to unite the grouping columns first (see desired output), but is there a more direct data.table syntax?
This is not an immediate duplicate to the famous questions:
https://stackoverflow.com/a/24558696/7941188 - because asking for grouping by one variable.
How to select the rows with maximum values in each group with dplyr? - because only dplyr solutions offered.
Cheers
library(tidyverse)
library(data.table)
set.seed(1)
mydf <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
mydf$value <- runif(nrow(mydf))
mydf %>%
group_by(A, B) %>%
filter(value == max(value)) %>%
arrange(A, B, C)
#> # A tibble: 25 x 4
#> # Groups: A, B [25]
#> A B C value
#> <int> <int> <int> <dbl>
#> 1 1 1 4 0.892
#> 2 1 2 1 0.898
#> 3 1 3 5 0.976
#> 4 1 4 2 0.821
#> 5 1 5 5 0.992
#> 6 2 1 4 0.864
#> 7 2 2 1 0.945
#> 8 2 3 2 0.794
#> 9 2 4 1 0.718
#> 10 2 5 3 0.839
#> # … with 15 more rows
Desired output - is there a way to get that without creating the united column first?
mydt <- mydf %>%
arrange(A,B,C) %>%
unite("A_B", A, B) %>%
as.data.table()
mydt[mydt[, .I[value == max(value)], by = A_B]$V1] %>%
separate(A_B, LETTERS[1:2]) %>%
head(10)
#> A B C value
#> 1: 1 1 4 0.8921983
#> 2: 1 2 1 0.8983897
#> 3: 1 3 5 0.9761707
#> 4: 1 4 2 0.8209463
#> 5: 1 5 5 0.9918386
#> 6: 2 1 4 0.8643395
#> 7: 2 2 1 0.9446753
#> 8: 2 3 2 0.7942399
#> 9: 2 4 1 0.7176185
#> 10: 2 5 3 0.8394404
Created on 2020-04-21 by the reprex package (v0.3.0)
You can compare value with max value in A and B, extract the logical vector and use it to subset data.table.
library(data.table)
setDT(mydf)
mydf[mydf[, value == max(value), .(A, B)]$V1, ]

Restructuring data (for IRR-analysis)

I have the following data-frame df (fictitious data) with several variables var1, var2, ..., var_n:
var1<-c("A","A","A","B","A","C","C","A", "A", "E", "E", "B")
var2<-c(NA,"1","1","5","6","2","3","1", "1", "3", "3", "2")
id<-c(1,2,2,3,3,4,4,5,5,6,6,7)
df<-data.frame(id, var1, var2)
df
id var1 var2
1 A <NA>
2 A 1
2 A 1
3 B 5
3 A 6
4 C 2
4 C 3
5 A 1
5 A 1
6 E 3
6 E 3
7 B 2
The data are retrieved from a document analysis where several coders extracted the values from physical files. Each file does have a specific id. Thus, if there are two entries with the same id this means that two different coders coded the same document. For example in document no. 4 both coders agreed that var1 has the value C, whereas in document no. 3 there is a dissent (A vs. B).
In order to calculate inter-rater-reliability (irr) I need to restructure the dataframe as follows:
id var1 var1_coder2 var2 var2_coder2
2 A A 1 5
3 B A 5 6
4 C C 2 3
5 C C 1 1
6 E E 3 3
Can anyone tell me how to get this done? Thanks!
You can transform your data with functions from dplyr (group_by, mutate) and tidyr (gather, spread, unite):
library(tidyr)
library(dplyr)
new_df <- df %>%
group_by(id) %>%
mutate(coder = paste0("coder_", 1:n())) %>%
gather("variables", "values", -id, -coder) %>%
unite(column, coder, variables) %>%
spread(column, values)
new_df
# A tibble: 7 x 5
# Groups: id [7]
# id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
# <dbl> <chr> <chr> <chr> <chr>
# 1 1 A NA NA NA
# 2 2 A 1 A 1
# 3 3 B 5 A 6
# 4 4 C 2 C 3
# 5 5 A 1 A 1
# 6 6 E 3 E 3
# 7 7 B 2 NA NA
If you only want to keep the rows where all coder have entered values you can use filter_all.
new_df %>%
filter_all(all_vars(!is.na(.)))
# A tibble: 5 x 5
# Groups: id [5]
# id coder_1_var1 coder_1_var2 coder_2_var1 coder_2_var2
# <dbl> <chr> <chr> <chr> <chr>
# 1 2 A 1 A 1
# 2 3 B 5 A 6
# 3 4 C 2 C 3
# 4 5 A 1 A 1
# 5 6 E 3 E 3

Summarise all using which on other column in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
For some reason, I could not find a solution using the summarise_all function for the following problem:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,10,9))
desired results:
df %>%
group_by(A) %>%
summarise(B = B[which.min(D)],
C = C[which.min(D)],
D = D[which.min(D)])
# A tibble: 4 x 4
A B C D
<dbl> <int> <int> <dbl>
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
What I tried:
df %>%
group_by(A) %>%
summarise_all(.[which.min(D)])
In words, I want to group by a variable and find for each column the value that belongs to the minimum value of another column. I could not find a solution for this using summarise_all. I am searching for a dplyr approach.
You can just filter down to the row that has a minimum value of D for each level of A. The code below assumes there is only one minimum row in each group.
df %>%
group_by(A) %>%
arrange(D) %>%
slice(1)
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
If there can be multiple rows with minimum D, then:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,9,9))
df %>%
group_by(A) %>%
filter(D == min(D))
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 7 2 9
5 4 8 1 9
You need filter - any time you're trying to drop some rows and keep others, that's the verb you want.
df %>% group_by(A) %>% filter(D == min(D))
#> # A tibble: 4 x 4
#> # Groups: A [4]
#> A B C D
#> <dbl> <int> <int> <dbl>
#> 1 1 1 8 1
#> 2 2 2 7 2
#> 3 3 4 5 1
#> 4 4 8 1 9

Resources