How to get ID numbers directly from pivot_wider - r

I have a dataframe like this:
library(tibble)
df <- tribble(~First, ~Last, ~Reviewer, ~Assessment, ~Amount,
"a", "b", "c", "Yes", 10,
"a", "b", "d", "No", 8,
"e", "f", "c", "No", 7,
"e", "f", "e", "Yes", 6)
df
#> # A tibble: 4 × 5
#> First Last Reviewer Assessment Amount
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 a b c Yes 10
#> 2 a b d No 8
#> 3 e f c No 7
#> 4 e f e Yes 6
I want to to use pivot_wider to convert df to a dataframe like this:
tribble(~First, ~Last, ~Reviewer_1, ~Assessment_1, ~Amount_1, ~Reviewer_2, ~Assessment_2, ~Amount_2,
"a", "b", "c", "Yes", 10, "d", "No", 8,
"e", "f", "c", "No", 7, "e", "Yes", 6)
#> # A tibble: 2 × 8
#> First Last Reviewer_1 Assessment_1 Amount_1 Reviewer_2 Assessment_2 Amount_2
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 a b c Yes 10 d No 8
#> 2 e f c No 7 e Yes 6
Is there a way to do this with the pivot_wider function? Note that the reviewer ID numbers in the second table are not included in the first table.

library(dplyr)
library(tidyr)
df %>%
group_by(First, Last) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(
c(First, Last), names_from = rn,
values_from = c(Reviewer, Assessment, Amount))
# # A tibble: 2 × 8
# First Last Reviewer_1 Reviewer_2 Assessment_1 Assessment_2 Amount_1 Amount_2
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 a b c d Yes No 10 8
# 2 e f c e No Yes 7 6
(order of columns notwithstanding)

Here is how you can do it :
df %>%
group_by(First, Last) %>%
mutate(Review_no = rank(Reviewer)) %>%
pivot_wider(names_from = Review_no,
values_from = c(Reviewer, Assessment, Amount))
output:
# A tibble: 2 x 8
# Groups: First, Last [2]
First Last Reviewer_1 Reviewer_2 Assessment_1 Assessment_2 Amount_1 Amount_2
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 a b c d Yes No 10 8
2 e f c e No Yes 7 6

Related

Handling duplicated entries

I would like to reassign a given records to a single group if the records are duplicated. In the below dataset I would like to to have 12-4 all being assigned to group A or B but not both. Is there a way to go abou it?
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c("12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8")
)
# Attempts to tease out records for each group
dat %>% pivot_wider(names_from = group, values_from = assigned)
You can group by record and reassign all to the same group, chosen at random from the available groups:
dat %>%
group_by(assigned) %>%
mutate(group = nth(group, sample(n())[1])) %>%
ungroup()
#> # A tibble: 9 x 2
#> group assigned
#> <chr> <chr>
#> 1 A 12-1
#> 2 A 12-2
#> 3 A 12-3
#> 4 A 12-4
#> 5 A 12-4
#> 6 B 12-5
#> 7 B 12-6
#> 8 B 12-7
#> 9 B 12-8
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c(
"12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8"
)
)
dat %>%
select(-group) %>%
left_join(
dat %>%
left_join(dat %>% count(group)) %>%
# reassign to the smallest group
arrange(n) %>%
select(-n) %>%
distinct(assigned, .keep_all = TRUE)
)
#> Joining, by = "group"
#> Joining, by = "assigned"
#> # A tibble: 9 × 2
#> assigned group
#> <chr> <chr>
#> 1 12-1 A
#> 2 12-2 A
#> 3 12-3 A
#> 4 12-4 A
#> 5 12-4 A
#> 6 12-5 B
#> 7 12-6 B
#> 8 12-7 B
#> 9 12-8 B
Created on 2022-04-04 by the reprex package (v2.0.0)

How to find last column with value (for each row), with some rows with all NA as values?

I was having the same problem as How to find last column with value (for each row) in R?, except I have rows with no value (entire row of NA). The sample provided in said post did not have an entire row of NAs.
I was wondering how I should modify the following? I do not want to remove those rows with all NAs because they will be useful in later analysis.
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
tail(na.omit(tmp), 1)}) %>%
ungroup
Thanks a lot in advance!
If all the elements in the rows are empty, then a general solution would be to create condition to return NA for those rows
library(dplyr)
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
if(all(is.na(tmp))) NA_character_ else
tail(na.omit(tmp), 1)}) %>%
ungroup
-output
# A tibble: 4 × 5
id m_1 m_2 m_3 m
<dbl> <chr> <chr> <chr> <chr>
1 1 a e i i
2 2 b <NA> <NA> b
3 3 <NA> <NA> <NA> <NA>
4 4 d h l l
If the OP wants to return only the last single non-NA element, we may also add an index [1] to extract, which automatically return NA when there are no elements
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
tail(na.omit(tmp), 1)[1]}) %>%
ungroup
# A tibble: 4 × 5
id m_1 m_2 m_3 m
<dbl> <chr> <chr> <chr> <chr>
1 1 a e i i
2 2 b <NA> <NA> b
3 3 <NA> <NA> <NA> <NA>
4 4 d h l l
data
df <- structure(list(id = c(1, 2, 3, 4), m_1 = c("a", "b", NA, "d"),
m_2 = c("e", NA, NA, "h"), m_3 = c("i", NA, NA, "l")), row.names = c(NA,
-4L), class = "data.frame")
Using data from #akrun (many thanks) we could do maybe this way:
'\\b[^,]+$' is a regular expression:
\\ ... means escape (in other words do not match) this is R special in other languages it is only one \
\\b... The metacharacter \b is an anchor like ^ and $ sign. It matches at a position that is called a “word boundary”. This match is zero-length.
[^,]+... stands for character class, here special with the ^caret: One character that is not ,. The + means here one or more ,
$ ... means end of string or end of line depending on multiline mode.
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("m"), ~case_when(!is.na(.) ~ cur_column()), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ', ') %>%
mutate(New_Col = str_extract(New_Col, '\\b[^,]+$'))
id m_1 m_2 m_3 New_Col
1 1 a e i m_3
2 2 b <NA> <NA> m_1
3 3 <NA> <NA> <NA> <NA>
4 4 d h l m_3
library(tidyverse)
df <- data.frame(id = c(1, 2, 3, 4), m_1 = c("a", NA, "c", "d"), m_2 = c("e", NA, "g", "h"), m_3 = c("i", NA, NA, "l"))
df %>%
rowwise() %>%
mutate(
nms = list(str_subset(names(df), "^m")),
m = c_across(starts_with("m")) %>%
{
ifelse(test = all(is.na(.)),
yes = NA,
no = nms[which(. == tail(na.omit(.), 1))]
)
}
) %>%
select(-nms)
#> # A tibble: 4 × 5
#> # Rowwise:
#> id m_1 m_2 m_3 m
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 a e i m_3
#> 2 2 <NA> <NA> <NA> <NA>
#> 3 3 c g <NA> m_2
#> 4 4 d h l m_3
# only the value no the column name
df %>%
rowwise() %>%
mutate(
m = c_across(starts_with("m")) %>%
{
ifelse(test = all(is.na(.)),
yes = NA,
no = tail(na.omit(.), 1)
)
}
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> id m_1 m_2 m_3 m
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 a e i i
#> 2 2 <NA> <NA> <NA> <NA>
#> 3 3 c g <NA> g
#> 4 4 d h l l
Created on 2022-01-01 by the reprex package (v2.0.1)

Slice out sequence of grouped rows [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Closed 1 year ago.
I have this data:
df <- data.frame(
node = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B", "B", "B", "B"),
left = c("ab", "ab", "ab", "ab", "cc", "xx", "cc", "ab", "zz", "xx", "xx", "zz", "zz")
)
I want to count grouped frequencies and proportions and slice/filter out a sequence of grouped rows. Say, given the small dataset, I want to have the rows with the two highest Freq_left values per group. How can that be done? I can only extract the rows with the maximum Freq_left values but not the desired sequence of rows:
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(Freq_left)
# A tibble: 2 × 4
# Groups: node [2]
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
1 A ab 4 30.8
2 B xx 3 23.1
Expected output:
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
A ab 4 30.8
A cc 2 15.4
B xx 3 23.1
B zz 2 15.4
You could use dplyr::top_n or dplyr::slice_max:
Thanks to #PaulSmith for pointing out that dplyr::top_n is superseded in favor of dplyr::slice_max:
library(dplyr)
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(order_by = Prop_left, n = 2)
#> `summarise()` has grouped output by 'node'. You can override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: node [2]
#> node left Freq_left Prop_left
#> <chr> <chr> <int> <dbl>
#> 1 A ab 4 30.8
#> 2 A cc 2 15.4
#> 3 B xx 3 23.1
#> 4 B zz 2 15.4

Replacing missing values by group and identifying mutual exclusiveness

I am working with the grouped data in R.
In the following data example, I would like to fill the missing values in "sex" variable, and keep as is if there was no corresponding data (i.e. for id=6).
In the "diagnosis" variable, some had only one diagnosis and some had multiple diagnosis. So, I also would like to group the variable "diagnosis" into "wanted" to identify mutual exclusiveness.
The example data is;
d.f <- tribble (
~id, ~sex, ~diagnosis,
1, "M", "A",
1, NA, "B",
1, NA, "C",
2, NA, "A",
2, "F", NA,
2, NA, "A",
3, NA, NA,
3, "M", "A",
3, "M", "B",
4, "F", "C",
5, "F", "B",
6, NA, "A",
7, "M", NA
)
The desired data is ;
wanted <- tribble (
~id, ~sex, ~diagnosis,~wanted,
1, "M", "A", "ABC group",
1, "M", "B", "ABC group",
1, "M", "C", "ABC group",
2, "F", "A", "Only A",
2, "F", NA, "Only A",
2, "F", "A", "Only A",
3, "M", NA, "AB group",
3, "M", "A", "AB group",
3, "M", "B", "AB group",
4, "F", "C", "Only C",
5, "F", "B", "Only B",
6, NA, "A", "Only A",
7, "M", NA, "Missing"
)
mutate sex column by using first(na.omit(sex)) first is just an aggregating function which is safe to use here
another column say wanted can be mutated in two steps.
paste all strings together in the group using paste(unique(na.omit(diagnosis)), collapse = '')
thereafter use case_when to mutate strings as per your choice
library(tidyverse)
d.f %>%
group_by(id) %>%
mutate(sex = first(na.omit(sex)),
wanted = { x <- paste(unique(na.omit(diagnosis)), collapse = '');
case_when(nchar(x) == 1 ~ paste0('Only ', x),
nchar(x) == 0 ~ 'Missing',
TRUE ~ paste(x, ' Group'))})
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC Group
#> 2 1 M B ABC Group
#> 3 1 M C ABC Group
#> 4 2 F A Only A
#> 5 2 F <NA> Only A
#> 6 2 F A Only A
#> 7 3 M <NA> AB Group
#> 8 3 M A AB Group
#> 9 3 M B AB Group
#> 10 4 F C Only C
#> 11 5 F B Only B
#> 12 6 <NA> A Only A
#> 13 7 M <NA> Missing
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
group_by(id) %>%
drop_na(diagnosis) %>%
summarise(wanted = str_c(c(unique(diagnosis)), collapse = "")) %>%
full_join(df1, . , by = "id") %>%
group_by(id) %>%
fill(sex, .direction = "updown")
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC
#> 2 1 M B ABC
#> 3 1 M C ABC
#> 4 2 F A A
#> 5 2 F <NA> A
#> 6 2 F A A
#> 7 3 M <NA> AB
#> 8 3 M A AB
#> 9 3 M B AB
#> 10 4 F C C
#> 11 5 F B B
#> 12 6 <NA> A A
#> 13 7 M <NA> <NA>
This can also be used:
library(dplyr)
d.f %>%
group_by(id) %>%
mutate(sex = coalesce(sex, sex[!is.na(sex)][1]),
wanted = across(diagnosis, ~ {x <- unique(diagnosis[!is.na(diagnosis)])
if_else(length(x) > 1, paste(paste(x, collapse = ""), "Group"),
if_else(length(x) == 1, paste("Only", x[1]), "Missing")
)}))
# A tibble: 13 x 4
# Groups: id [7]
id sex diagnosis wanted$diagnosis
<dbl> <chr> <chr> <chr>
1 1 M A ABC Group
2 1 M B ABC Group
3 1 M C ABC Group
4 2 F A Only A
5 2 F NA Only A
6 2 F A Only A
7 3 M NA AB Group
8 3 M A AB Group
9 3 M B AB Group
10 4 F C Only C
11 5 F B Only B
12 6 NA A Only A
13 7 M NA Missing

How to sort only one column without changing the others?

I have this example dataframe:
df <- tibble::tribble(
~id, ~A, ~B, ~C, ~D,
1L, "a", "d", "a", "a",
2L, "b", "c", "b", "b",
3L, "c", "b", "c", "c",
4L, "d", "a", "d", "d")
I want to sort only column B in ascending order without changing any other column.
Desired output:
id A B C D
<int> <chr> <chr> <chr> <chr>
1 1 a a a a
2 2 b b b b
3 3 c c c c
4 4 d d d d
I have tried arrange:
df %>%
arrange(B)
Here all other columns also change as expected.
Is there a way to only sort one column, although it might be against the logic of a dataframe?
An option could be:
df %>%
mutate(B = sort(B))
id A B C D
<int> <chr> <chr> <chr> <chr>
1 1 a a a a
2 2 b b b b
3 3 c c c c
4 4 d d d d
base
df$B <- sort(df$B)
# A tibble: 4 x 5
id A B C D
<int> <chr> <chr> <chr> <chr>
1 1 a a a a
2 2 b b b b
3 3 c c c c
4 4 d d d d

Resources