Collapse rows in R - r

I have a dataframe
df <- data.frame(id1 = c("a" , "b", "b", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "a", "e"),
n1 = c(2,2,2,3),
n2 = c(2,1,1,1),
n3 = c(0,1,1,3),
n4 = c(0,1,1,2))
I want to collapse the 2nd and 3rd rows into one. Afterwards, I will do aggregate by column id3 sharing same character (i.e. a).
My real dataframe is long contaning many different latin names, filter by name i.e. a doesn´t make sense this case. I am thinking to collapse rows with the condition id3 == id2, but I could not do it. Any sugesstions for me?
My desired out put like this
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
b a a 2 1 1 1
c NA e 3 1 3 2
#Afterthat, it should be
id1 id3 n1 n2 n3 n4
a a 4 3 1 1
c e 3 1 3 2
(I just updated the dataframe, sorry for my mistake)

We get the distinct rows to generate the first expected
library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2
The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns
df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')
-output
# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2

Here is a slightly different way using slice after group_by instead of distinct:
df %>%
group_by(id1, id3) %>%
dplyr::slice(1L) %>%
mutate(id1 = coalesce(id2,id1)) %>%
summarise(across(where(is.numeric), sum))
output:
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2

Related

How to collapse redundant rows together to get rid of mirrored NAs in two columns?

I'm modifying this toy df from this question, which is similar to mine but different enough that its answer has left me slightly confused.
df <- data.frame(id1 = c("a" , "NA", "NA", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "e", "e"),
n1 = c(2,2,3,3),
n2 = c(2,2,1,1),
n3 = c(0,0,3,3),
n4 = c(0,0,2,2))
This produces a dataframe looking like this:
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
NA a a 2 2 0 0
NA a e 3 1 3 2
c NA e 3 1 3 2
Aside from id1 and id2, the first two rows and the last two rows are identical. I'm trying to fill in the blanks to make them completely identical, so I can then apply distinct() so that the now-duplicated rows disappear, resulting in a dataframe like this:
id1 id2 id3 n1 n2 n3 n4
a a a 2 2 0 0
c a e 3 1 3 2
Is there any way to accomplish this (preferably a tidyverse solution)? I'm basically trying to collapse all my data's redundancies.
Perhaps something like this?
df %>%
group_by(id3, n1, n2, n3, n4) %>%
summarise(id1 = na.omit(id1),
id2 = na.omit(id2)) %>%
ungroup() %>%
select(id1,id2,id3,n1,n2,n3,n4)
output
# A tibble: 2 × 7
id1 id2 id3 n1 n2 n3 n4
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 a a a 2 2 0 0
2 c a e 3 1 3 2
This solution is very specific to this scenario. It would not work if you had multiple id1s per group for example.
Another possible solution where I first created an index to group on:
df <- data.frame(id1 = c("a" , "NA", "NA", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "e", "e"),
n1 = c(2,2,3,3),
n2 = c(2,2,1,1),
n3 = c(0,0,3,3),
n4 = c(0,0,2,2))
library(dplyr)
df %>%
mutate(index = rep(seq_len(2), each=2)) %>%
group_by(index) %>%
arrange(id1) %>%
summarise(across(everything(), funs(first(.[!is.na(.)])))) %>%
select(-index)
#> # A tibble: 2 × 7
#> id1_first id2_first id3_first n1_first n2_first n3_first n4_first
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a a a 2 2 0 0
#> 2 c a e 3 1 3 2
Created on 2022-07-09 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
df %>%
group_by(id3, across(n1:n4)) %>%
fill(id1:id2, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 2 × 7
#> id1 id2 id3 n1 n2 n3 n4
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a a a 2 2 0 0
#> 2 c a e 3 1 3 2

R Calculate sum of values by unique column PAIRS (B-A and A-B) while keeping both pairs [duplicate]

This question already has answers here:
R sum observations by unique column PAIRS (B-A and A-B) and NOT unique combinations (B-A or A-B)
(1 answer)
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
I'm dealing with the following issue. I would like to sum the count by Date, and unique pair of ID1 and ID2, meaning that A-B and B-A are ONE pair. However, I want to keep both pairs and their sum in my dataset.
My Dataset looks like this:
Date ID1 ID2 Count
12-1 A B 1
12-1 B A 1
12-1 D E 1
12-1 E D 2
12-2 Y Z 2
12-2 Z Y 3
An expected output looks like this:
Date ID1 ID2 SUM
12-1 A B 2
12-1 B A 2
12-1 D E 3
12-1 E D 3
12-2 Y Z 5
12-2 Z Y 5
My Question can be seen as an extension of this previous question:
R sum observations by unique column PAIRS (B-A and A-B) and NOT unique combinations (B-A or A-B)
Many thanks in advance.
Here is a way.
First, create a vector of sorted values in the columns ID1 and ID2, and paste them together. Then group with ave. Finally, remove the vector of unique values.
df1$unique <- apply(df1[c("ID1", "ID2")], 1, \(x) paste(sort(x), collapse = ""))
df1$Sum <- with(df1, ave(Count, unique, FUN = sum))
df1$unique <- NULL
df1
# Date ID1 ID2 Count Sum
#1 12-1 A B 1 2
#2 12-1 B A 1 2
#3 12-1 D E 1 3
#4 12-1 E D 2 3
#5 12-2 Y Z 2 5
#6 12-2 Z Y 3 5
This may also be done with pmin/pmax to create a grouping column
library(dplyr)
library(stringr)
df1 %>%
group_by(Date, grp = str_c(pmin(ID1, ID2), pmax(ID1, ID2))) %>%
mutate(Sum = sum(Count)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 × 5
Date ID1 ID2 Count Sum
<chr> <chr> <chr> <int> <int>
1 12-1 A B 1 2
2 12-1 B A 1 2
3 12-1 D E 1 3
4 12-1 E D 2 3
5 12-2 Y Z 2 5
6 12-2 Z Y 3 5
data
df1 <- structure(list(Date = c("12-1", "12-1", "12-1", "12-1", "12-2",
"12-2"), ID1 = c("A", "B", "D", "E", "Y", "Z"), ID2 = c("B",
"A", "E", "D", "Z", "Y"), Count = c(1L, 1L, 1L, 2L, 2L, 3L)),
class = "data.frame", row.names = c(NA,
-6L))
Here is a dplyr solution making use of lapply:
In essence we create a new column y that orders the characters in alphabetically order, so that we can group also for this column:
library(dplyr)
library(stringr)
df %>%
mutate(x = paste(ID1, ID2)) %>%
mutate(y = str_split(x, ' ') %>% lapply(., 'sort') %>% lapply(., 'paste', collapse=' ')) %>%
group_by(Date, y) %>%
mutate(SUM = sum(Count)) %>%
ungroup() %>%
select(-c(x, y, Count))
Date ID1 ID2 SUM
<chr> <chr> <chr> <int>
1 12-1 A B 2
2 12-1 B A 2
3 12-1 D E 3
4 12-1 E D 3
5 12-2 Y Z 5
6 12-2 Z Y 5

Adding values to one columns based on conditions

I would like to update one column based on 2 columns
My example dataframe contains 3 columns
df <- data.frame(n1 = c(1,2,1,2,5,6),
n2 = c("a", "a", "a", NA, "b", "c"),
n3 = c("red", "red", NA, NA, NA, NA))
df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a <NA>
4 2 <NA> <NA>
5 5 b <NA>
6 6 c <NA>
I would like to add red name to row number 3 and 4 with the condition is that if values of n1 (i.e. 1,2) match with n2 (i.e. a), even though the fourth row (n1 not match n2).
The main point is if n2 == a, and values of n1 associated with a, then values of n3 that are the same row with values of n1 should be added with red.
My desired output
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Any sugesstions for this case? I hope my explanation is clear enough. Since my data is very long, I am trying to find a good to handle it.
In base R, create a logical vector to subset the rows of 'df' based on the unique values of 'n1' where 'n2' is "a", then do the assignment of 'n3' corresponding to that elements with the first non-NA element from 'n3'
i1 <- with(df, n1 %in% unique(n1[n2 %in% 'a']))
df$n3[i1] <- na.omit(df$n3[i1])[1]
-output
> df
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
Update:
df %>%
mutate(group = rep(row_number(), each=2, length.out = n())) %>%
group_by(group) %>%
mutate(n3 = ifelse(n1 %in% c(1,2) & any(n2 %in% "a", na.rm = TRUE), "red", n3)) %>%
ungroup() %>%
select(-group)
We could use an ifelse statement with conditions defined using any.
library(dplyr)
df %>%
mutate(n3 = ifelse(n1==1 | n1==2 & any(n2[3:4] %in% "a"), "red", n3))
n1 n2 n3
1 1 a red
2 2 a red
3 1 a red
4 2 <NA> red
5 5 b <NA>
6 6 c <NA>
library(dplyr)
library(tidyr)
df %>%
group_by(n1) %>%
fill(n3) %>%
group_by(n2) %>%
fill(n3)
# # A tibble: 6 × 3
# # Groups: n2 [4]
# n1 n2 n3
# <dbl> <chr> <chr>
# 1 1 a red
# 2 2 a red
# 3 1 a red
# 4 2 NA red
# 5 5 b NA
# 6 6 c NA

aggregate rows with condition in R

My example
df <- data.frame(id1 = c("a" , "b", "c"),
id2 = c("a", "a", "d"),
n1 = c(2,2,0),
n2 = c(2,1,1),
n3 = c(0,1,1),
n4 = c(0,1,1))
First, I already aggregated all rows across column like this
df <- df %>%
group_by(id2) %>%
summarise(across(c(n1,n2,n3,n4), sum, na.rm = TRUE),
.groups = "drop")
Now, but now I would like to aggregate only 2 first rows having a in column id2. How we keep the column id1 since my desire output like this. Honestly, that column is just used to compare to id2 and is quite redundant, but I really want to keep it.
id1 id2 n1 n2 n3 n4
a a 4 3 1 1
c d 0 1 1 1
Any suggestions for this?
Change the id2 values where it has 'a' in it.
library(dplyr)
df %>%
group_by(id1 = ifelse(id2 == 'a', id2, id1), id2) %>%
summarise(across(starts_with('n'), sum, na.rm = TRUE), .groups = "drop")
# id1 id2 n1 n2 n3 n4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a a 4 3 1 1
#2 c d 0 1 1 1
Other solution would be using case_when. This function is more readable if you need to use multiple casuistic sentences:
library(dplyr)
df %>%
mutate(id1 = case_when(
id2 == 'a' ~ id2,
TRUE ~ id1
)) %>%
group_by(id1, id2) %>%
summarise(across(starts_with('n'), sum, na.rm = TRUE),
.groups = "drop")
which yields:
## A tibble: 2 x 6
# id1 id2 n1 n2 n3 n4
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a a 4 3 1 1
#2 c d 0 1 1 1
Note: The summarise part was copied from #Ronak Shah's answer

Sum values from DF and make a new one

I have this dataframe in R:
ID <- c(rep("ID1" , 4) , rep("ID2" , 4))
mut <- rep(c("AC", "TG", "AG", "TC"), 2)
count <- c(2,4,6,8,1,3,5,7)
data.frame(ID, mut, count)
ID mut count
1 ID1 AC 2
2 ID1 TG 4
3 ID1 AG 6
4 ID1 TC 8
5 ID2 AC 1
6 ID2 TG 3
7 ID2 AG 5
8 ID2 TC 7
I want to create a new one where I sum the values of count based on "mut" column.
Basically, for each ID, I would sum the count from mut=AC and TG and from AG and TC, to obtain this:
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
I have absolutely no clue on how to do this!!
Thanks!!
M
You better make sure you have an even number of elements in each ID.
df=data.frame(ID, mut, count)
df$sek=rep(1:(nrow(df)/2),each=2)
do.call(rbind,
by(df,list(df$sek),function(x){
data.frame(
"ID"=x$ID[1],
"new_mut"=paste0(x$mut,collapse="-"),
"count"=sum(x$count)
)
})
)
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
Using dplyr :
library(dplyr)
df %>%
group_by(ID, val = ceiling(match(mut, unique(mut))/2)) %>%
summarise(mut = paste0(mut,collapse="-"),
count = sum(count)) %>%
select(-val)
# ID mut count
# <chr> <chr> <dbl>
#1 ID1 AC-TG 6
#2 ID1 AG-TC 14
#3 ID2 AC-TG 4
#4 ID2 AG-TC 12

Resources