How to remove missing values in summarise_all dplyr [duplicate] - r

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 7 months ago.
I'm having trouble to exclude missing values in summarise_all function.
I have a dataset (df) as shown below and basically I'm having two problems:
excluding missing values and the output only being one number
additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)
df1 dataset is the one I'm trying to get to.
Here's the whole enchilada:
df #the original dataset
ID type of data genes1 genes2 genes3 ...
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1
...
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code
#output
ID type of data genes1 genes2 genes3 ...
1 c("new","old","suggested") c(2,NA,NA) c(0,NA,NA) c(2,NA,NA)
1 TRUE TRUE TRUE TRUE
2 c("new","old","suggested") c(1,NA,NA) c(1,NA,NA) c(1,NA,NA)
2 TRUE TRUE TRUE TRUE
...
#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this
df1 #dream dataset
ID type of data genes1 genes2 genes3 ...
1 #doesn't matter 2 0 2
2 #doesn't matter 1 1 1
...
I also tried using na.omit in summarise_all but it didn't really fix anything.
Does anybody have any ideas on how to fix it?

You could do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('genes'), ~.[!is.na(.)]))
#> # A tibble: 2 × 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

Another way
library (dplyr)
df[-2] |>
group_by(ID) |>
fill(genes1:genes3, .direction = "downup") |>
slice(1)
ID genes1 genes2 genes3
<int> <int> <int> <int>
1 1 2 0 2
2 2 1 1 1

An alternative approach based on the coalesce() function from tidyr
In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.
We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.
Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.
library(dplyr)
library(tidyr)
# Define a function to coalesce by column
coalesce_by_column <- function(df) {
coalesce(!!! as.list(df))
}
# Remove NA rows
df %>%
select(-type) %>%
group_by(ID) %>%
summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

If you are not worried about the type column you can do something like this
library(tidyverse)
" ID type genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1" %>%
read_table() -> df
df %>%
pivot_longer(-c(ID, type)) %>%
drop_na(value) %>%
select(-type) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
ID genes1 genes2 genes3
<dbl> <dbl> <dbl> <dbl>
1 1 2 0 2
2 2 1 1 1

If you want to keep the "type of data" column while using summarise, you can use the following code:
df <- read.table(text = "ID type_of_data genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1", header = TRUE)
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
summarise(across(starts_with("genes"), na.omit),
type_of_data = type_of_data[genes1]) %>%
ungroup()
df1
#> # A tibble: 2 × 5
#> ID genes1 genes2 genes3 type_of_data
#> <int> <int> <int> <int> <chr>
#> 1 1 2 0 2 old
#> 2 2 1 1 1 new
Created on 2022-07-26 by the reprex package (v2.0.1)

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

Including missing values in summarise output

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?
If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA
This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA
My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

Using the dplyr library in R to "print" the name of the non-NA columns

Here is my data frame:
a <- data.frame(id=c(rep("A",2),rep("B",2)),
x=c(rep(2,2),rep(3,2)),
p.ABC= c(1,NA,1,1),
p.DEF= c(NA,1,NA,NA),
p.TAR= c(1,NA,1,1),
p.REP= c(NA,1,1,NA),
p.FAR= c(NA,NA,1,1))
I Want to create a new character column (using mutate() in the dplyr library in R), which tells (by row) the name of the columns that have a non-NA value (here the non-NA value is always 1). However, it should only search among the columns that start with "p." and it should order the names by alphabetical order and then concatenate them using the expression "_" as a separator. You can find below the desired result, under the column called "name":
data.frame(id=c(rep("A",2),rep("B",2)),
x=c(rep(2,2),rep(3,2)),
p.ABC= c(1,NA,1,1),
p.DEF= c(NA,1,NA,NA),
p.TAR= c(1,NA,1,1),
p.REP= c(NA,1,1,NA),
p.FAR= c(NA,NA,1,1),
name=c("ABC_TAR","DEF_REP","ABC_FAR_REP_TAR","ABC_FAR_TAR"))
I would like to emphasize that I'm really looking for a solution using dplyr, as I would be able to do it without it (but it doesn't look pretty and it's slow).
Here is an option with tidyverse, where we reshape the data into 'long' format with pivot_longer, grouped by row_number()), paste the column name column 'name' values after removing the prefix part and then bind that column with the original data
library(dplyr)
library(stringr)
library(tidyr)
a %>%
mutate(rn = row_number()) %>%
select(-id, -x) %>%
pivot_longer(cols = -rn, values_drop_na = TRUE) %>%
group_by(rn) %>%
summarise(name = str_c(str_remove(name, ".*\\."), collapse="_"),
.groups = 'drop') %>%
select(-rn) %>%
bind_cols(a, .)
-output
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
Or use pmap
library(purrr)
a %>%
mutate(name = pmap_chr(select(cur_data(), contains('.')), ~ {
nm1 <- c(...)
str_c(str_remove(names(nm1)[!is.na(nm1)], '.*\\.'), collapse="_")}))
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
Or use apply in base R
apply(a[-(1:2)], 1, function(x) paste(sub(".*\\.", "",
names(x)[!is.na(x)]), collapse="_"))
#[1] "ABC_TAR" "DEF_REP" "ABC_TAR_REP_FAR" "ABC_TAR_FAR"
I think my answer may be similar to others, still I feel syntax is written in tidyverse pipe style so may be easier to understand. Still someone, if feels it is copy of theirs, I will be happy to delete it.
a %>% mutate(name = pmap(select(cur_data(), contains('p')),
~ names(c(...))[!is.na(c(...))] %>%
str_remove_all(., "p.") %>%
paste(., collapse = '_')
)
)
id x p.ABC p.DEF p.TAR p.REP p.FAR name
1 A 2 1 NA 1 NA NA ABC_TAR
2 A 2 NA 1 NA 1 NA DEF_REP
3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
The idea behind it is actually we can use pipes inside of map/reduce family of functions so as to obviate the necessity of writing a custom function beforehand and also creating intermediate objects inside {}
Using rowwise :
library(dplyr)
cols <- grep('^p\\.', names(a), value = TRUE)
a %>%
rowwise() %>%
mutate(name = paste0(sub('p\\.', '',
cols[!is.na(c_across(starts_with('p')))]), collapse = '_')) %>%
ungroup
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
Updated
Special thanks to dear #akrun for helping me improve my codes:
We just made a subtle modification to suppress a message produced by unnest_wider.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
a %>%
mutate(name = pmap(select(a, starts_with("p.")), ~ {nm1 <- names(c(...))[!is.na(c(...))];
setNames(nm1, seq_along(nm1))})) %>%
unnest_wider(name) %>%
rowwise() %>%
mutate(across(8:11, ~ str_remove(., fixed("p.")))) %>%
unite(NAME, c(8:11), sep = "_", na.rm = TRUE)
# A tibble: 4 x 8
id x p.ABC p.DEF p.TAR p.REP p.FAR NAME
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 A 2 1 NA 1 NA NA ABC_TAR
2 A 2 NA 1 NA 1 NA DEF_REP
3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
4 B 3 1 NA 1 NA 1 ABC_TAR_FAR

how to set names in a dynamically long list

Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Resources