aggregate rows with condition in R - r

My example
df <- data.frame(id1 = c("a" , "b", "c"),
id2 = c("a", "a", "d"),
n1 = c(2,2,0),
n2 = c(2,1,1),
n3 = c(0,1,1),
n4 = c(0,1,1))
First, I already aggregated all rows across column like this
df <- df %>%
group_by(id2) %>%
summarise(across(c(n1,n2,n3,n4), sum, na.rm = TRUE),
.groups = "drop")
Now, but now I would like to aggregate only 2 first rows having a in column id2. How we keep the column id1 since my desire output like this. Honestly, that column is just used to compare to id2 and is quite redundant, but I really want to keep it.
id1 id2 n1 n2 n3 n4
a a 4 3 1 1
c d 0 1 1 1
Any suggestions for this?

Change the id2 values where it has 'a' in it.
library(dplyr)
df %>%
group_by(id1 = ifelse(id2 == 'a', id2, id1), id2) %>%
summarise(across(starts_with('n'), sum, na.rm = TRUE), .groups = "drop")
# id1 id2 n1 n2 n3 n4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a a 4 3 1 1
#2 c d 0 1 1 1

Other solution would be using case_when. This function is more readable if you need to use multiple casuistic sentences:
library(dplyr)
df %>%
mutate(id1 = case_when(
id2 == 'a' ~ id2,
TRUE ~ id1
)) %>%
group_by(id1, id2) %>%
summarise(across(starts_with('n'), sum, na.rm = TRUE),
.groups = "drop")
which yields:
## A tibble: 2 x 6
# id1 id2 n1 n2 n3 n4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a a 4 3 1 1
#2 c d 0 1 1 1
Note: The summarise part was copied from #Ronak Shah's answer

Related

How to collapse redundant rows together to get rid of mirrored NAs in two columns?

I'm modifying this toy df from this question, which is similar to mine but different enough that its answer has left me slightly confused.
df <- data.frame(id1 = c("a" , "NA", "NA", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "e", "e"),
n1 = c(2,2,3,3),
n2 = c(2,2,1,1),
n3 = c(0,0,3,3),
n4 = c(0,0,2,2))
This produces a dataframe looking like this:
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
NA a a 2 2 0 0
NA a e 3 1 3 2
c NA e 3 1 3 2
Aside from id1 and id2, the first two rows and the last two rows are identical. I'm trying to fill in the blanks to make them completely identical, so I can then apply distinct() so that the now-duplicated rows disappear, resulting in a dataframe like this:
id1 id2 id3 n1 n2 n3 n4
a a a 2 2 0 0
c a e 3 1 3 2
Is there any way to accomplish this (preferably a tidyverse solution)? I'm basically trying to collapse all my data's redundancies.
Perhaps something like this?
df %>%
group_by(id3, n1, n2, n3, n4) %>%
summarise(id1 = na.omit(id1),
id2 = na.omit(id2)) %>%
ungroup() %>%
select(id1,id2,id3,n1,n2,n3,n4)
output
# A tibble: 2 × 7
id1 id2 id3 n1 n2 n3 n4
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 a a a 2 2 0 0
2 c a e 3 1 3 2
This solution is very specific to this scenario. It would not work if you had multiple id1s per group for example.
Another possible solution where I first created an index to group on:
df <- data.frame(id1 = c("a" , "NA", "NA", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "e", "e"),
n1 = c(2,2,3,3),
n2 = c(2,2,1,1),
n3 = c(0,0,3,3),
n4 = c(0,0,2,2))
library(dplyr)
df %>%
mutate(index = rep(seq_len(2), each=2)) %>%
group_by(index) %>%
arrange(id1) %>%
summarise(across(everything(), funs(first(.[!is.na(.)])))) %>%
select(-index)
#> # A tibble: 2 × 7
#> id1_first id2_first id3_first n1_first n2_first n3_first n4_first
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a a a 2 2 0 0
#> 2 c a e 3 1 3 2
Created on 2022-07-09 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
df %>%
group_by(id3, across(n1:n4)) %>%
fill(id1:id2, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 2 × 7
#> id1 id2 id3 n1 n2 n3 n4
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a a a 2 2 0 0
#> 2 c a e 3 1 3 2

Collapse rows in R

I have a dataframe
df <- data.frame(id1 = c("a" , "b", "b", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "a", "e"),
n1 = c(2,2,2,3),
n2 = c(2,1,1,1),
n3 = c(0,1,1,3),
n4 = c(0,1,1,2))
I want to collapse the 2nd and 3rd rows into one. Afterwards, I will do aggregate by column id3 sharing same character (i.e. a).
My real dataframe is long contaning many different latin names, filter by name i.e. a doesn´t make sense this case. I am thinking to collapse rows with the condition id3 == id2, but I could not do it. Any sugesstions for me?
My desired out put like this
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
b a a 2 1 1 1
c NA e 3 1 3 2
#Afterthat, it should be
id1 id3 n1 n2 n3 n4
a a 4 3 1 1
c e 3 1 3 2
(I just updated the dataframe, sorry for my mistake)
We get the distinct rows to generate the first expected
library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2
The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns
df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')
-output
# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2
Here is a slightly different way using slice after group_by instead of distinct:
df %>%
group_by(id1, id3) %>%
dplyr::slice(1L) %>%
mutate(id1 = coalesce(id2,id1)) %>%
summarise(across(where(is.numeric), sum))
output:
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2

Add zero value to one row in R using tidyverse (wide format table)

I want to add 0 numbers to one row in a wideformat table based on 2 columns by using tidyverse.
My example is:
df <- data.frame(id1 = c("a" , "b", "c"),
id2 = c("a", "a", "d"),
n1 = c(2,2,0),
n2 = c(2,1,1),
n3 = c(0,1,1),
n4 = c(0,1,1))
I would like to say that if the id1 == b and id2 == a, the entire second row of the table would be added with 0 number, except for the column n3.
My desire output:
id1 id2 n1 n2 n3 n4
a a 2 2 0 0
b a 0 0 1 0
b d 0 1 1 1
I can do with base R, but I am trying to do with the package tidyverse.
Any suggestion for this?
Using pivot_longer and then pivot_wider:
df %>%
pivot_longer(cols = 3:6) %>%
mutate(value = ifelse(id1 == "b" &
id2 == "a" & name != "n3", 0, value)) %>%
pivot_wider(names_from = name, values_from = value)
Output:
# A tibble: 3 x 6
id1 id2 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 2 2 0 0
2 b a 0 0 1 0
3 c d 0 1 1 1
Edit: To keep the data "wide":
df %>%
mutate(across(-c(id1,id2,n3),
~ifelse(id1 == "b" & id2 == "a",
0,
.)
)
)

a beautiful solution to decode a table with dplyr and mutate

Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c

Using `rle` function along with `dplyr` `group_by` command to mapping grouping variable

I have a dataframe with three columns that has information similar to the data frame given below. Now I wish to extract information search pattern based on the information in column a.
Based on the support from few developers (#thelatemail and #David T), I was able to identify the pattern with rle function, please see here - using rle function to identify pattern. Now, I wish to move ahead and add grouping information to the extracted pattern. I tried with dplyr do function - refer to the code below. However, this does not work.
The example data and desired output is given as well for your reference.
##mycode that produces error - needs to be fixed
test <- data%>%
group_by(b, c)%>%
do(., data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)
## desired output
c b from to fromCount toCount
<chr> <chr> <int> <int>
1 A01 experimental a b 1 3
2 A02 experimental a c 1 1
3 A02 experimental c a 1 1
4 A02 experimental a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
Compared to the earlier post here, the information gets compressed since we apply grouping to the a column.
We could use rleid from data.table
library(data.table)
library(dplyr)
data %>%
group_by(b, c, grp = rleid(a)) %>%
summarise(from = first(a), fromCount = n()) %>%
mutate(to = lead(from), toCount = lead(fromCount)) %>%
ungroup %>%
select(-grp) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
Or using rle, after grouping by 'b', 'c', summarise with rle to create a list column, then extract the 'values' and 'lengths' from column in summarise, create the 'to', 'toCount' on the lead of the 'from', 'fromCount' column filter out the NA elements and arrange the rows based on the 'c' column
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a)),
from = rl[[1]]$values,
fromCount = rl[[1]]$lengths) %>%
mutate(to = lead(from),
toCount = lead(fromCount)) %>%
ungroup %>%
select(-rl) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
We could also loop over the rle list column ('rl') with map, extract the components, and take the lead of the lengths, values in a tibble, use unnest_wider to create the columns and unnest the list structure, filter out the NA elements and arrange
library(tidyr)
library(purrr)
data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a))) %>%
ungroup %>%
mutate(out = map(rl,
~ tibble(from = .x$values,
fromCount = .x$lengths,
to = lead(from),
toCount = lead(fromCount)))) %>%
unnest_wider(c(out)) %>%
unnest(from:toCount) %>%
filter(!is.na(to)) %>%
arrange(c) %>%
select(-rl)
Or in the tidyverse, create a function that does the rle for the Tracking for a single subject
rleSlice <- function(Tracking) {
rlTrack <- rle(as.character(Tracking)) # Strip the levels from the factor, they interfere
tibble(from = rlTrack$values, to = lead(rlTrack$values),
fromCount = rlTrack$lengths, toCount = lead(rlTrack$lengths)) %>%
filter(!is.na(to)) %>%
list()
}
Make sure it's behaving
[[1]]
rleSlice(c("a", "b", "b", "b", "c"))
A tibble: 2 x 4
from to fromCount toCount
<chr> <chr> <int> <int>
1 a b 1 3
2 b c 3 1
Now we'll group and get the rle for each participant
data %>%
as_tibble() %>%
# This is easier to track than all these a,b,c's
rename(Subject = c, Test = b, Tracking = a) %>%
group_by(Subject, Test) %>%
summarise(Slice = rleSlice(Tracking)) %>%
unnest(col = "Slice") %>%
ungroup()
# A tibble: 6 x 6
Subject Test from to fromCount toCount
<fct> <fct> <chr> <chr> <int> <int>
1 A01 experiment a b 1 3
2 A02 experiment a c 1 1
3 A02 experiment c a 1 1
4 A02 experiment a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2

Resources