R - Summarize dataframe to avoid NAs

R - Summarize dataframe to avoid NAs - r

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().

You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.

Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2

Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

Related

How to remove missing values in summarise_all dplyr [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 7 months ago.
I'm having trouble to exclude missing values in summarise_all function.
I have a dataset (df) as shown below and basically I'm having two problems:
excluding missing values and the output only being one number
additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)
df1 dataset is the one I'm trying to get to.
Here's the whole enchilada:
df #the original dataset
ID type of data genes1 genes2 genes3 ...
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1
...
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code
#output
ID type of data genes1 genes2 genes3 ...
1 c("new","old","suggested") c(2,NA,NA) c(0,NA,NA) c(2,NA,NA)
1 TRUE TRUE TRUE TRUE
2 c("new","old","suggested") c(1,NA,NA) c(1,NA,NA) c(1,NA,NA)
2 TRUE TRUE TRUE TRUE
...
#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this
df1 #dream dataset
ID type of data genes1 genes2 genes3 ...
1 #doesn't matter 2 0 2
2 #doesn't matter 1 1 1
...
I also tried using na.omit in summarise_all but it didn't really fix anything.
Does anybody have any ideas on how to fix it?

You could do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('genes'), ~.[!is.na(.)]))
#> # A tibble: 2 × 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

Another way
library (dplyr)
df[-2] |>
group_by(ID) |>
fill(genes1:genes3, .direction = "downup") |>
slice(1)
ID genes1 genes2 genes3
<int> <int> <int> <int>
1 1 2 0 2
2 2 1 1 1

An alternative approach based on the coalesce() function from tidyr
In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.
We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.
Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.
library(dplyr)
library(tidyr)
# Define a function to coalesce by column
coalesce_by_column <- function(df) {
coalesce(!!! as.list(df))
}
# Remove NA rows
df %>%
select(-type) %>%
group_by(ID) %>%
summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

If you are not worried about the type column you can do something like this
library(tidyverse)
" ID type genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1" %>%
read_table() -> df
df %>%
pivot_longer(-c(ID, type)) %>%
drop_na(value) %>%
select(-type) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
ID genes1 genes2 genes3
<dbl> <dbl> <dbl> <dbl>
1 1 2 0 2
2 2 1 1 1

If you want to keep the "type of data" column while using summarise, you can use the following code:
df <- read.table(text = "ID type_of_data genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1", header = TRUE)
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
summarise(across(starts_with("genes"), na.omit),
type_of_data = type_of_data[genes1]) %>%
ungroup()
df1
#> # A tibble: 2 × 5
#> ID genes1 genes2 genes3 type_of_data
#> <int> <int> <int> <int> <chr>
#> 1 1 2 0 2 old
#> 2 2 1 1 1 new
Created on 2022-07-26 by the reprex package (v2.0.1)

Including missing values in summarise output

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?

If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA

This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA

My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

Using the dplyr library in R to "print" the name of the non-NA columns

Here is my data frame:
a <- data.frame(id=c(rep("A",2),rep("B",2)),
x=c(rep(2,2),rep(3,2)),
p.ABC= c(1,NA,1,1),
p.DEF= c(NA,1,NA,NA),
p.TAR= c(1,NA,1,1),
p.REP= c(NA,1,1,NA),
p.FAR= c(NA,NA,1,1))
I Want to create a new character column (using mutate() in the dplyr library in R), which tells (by row) the name of the columns that have a non-NA value (here the non-NA value is always 1). However, it should only search among the columns that start with "p." and it should order the names by alphabetical order and then concatenate them using the expression "_" as a separator. You can find below the desired result, under the column called "name":
data.frame(id=c(rep("A",2),rep("B",2)),
x=c(rep(2,2),rep(3,2)),
p.ABC= c(1,NA,1,1),
p.DEF= c(NA,1,NA,NA),
p.TAR= c(1,NA,1,1),
p.REP= c(NA,1,1,NA),
p.FAR= c(NA,NA,1,1),
name=c("ABC_TAR","DEF_REP","ABC_FAR_REP_TAR","ABC_FAR_TAR"))
I would like to emphasize that I'm really looking for a solution using dplyr, as I would be able to do it without it (but it doesn't look pretty and it's slow).

Here is an option with tidyverse, where we reshape the data into 'long' format with pivot_longer, grouped by row_number()), paste the column name column 'name' values after removing the prefix part and then bind that column with the original data
library(dplyr)
library(stringr)
library(tidyr)
a %>%
mutate(rn = row_number()) %>%
select(-id, -x) %>%
pivot_longer(cols = -rn, values_drop_na = TRUE) %>%
group_by(rn) %>%
summarise(name = str_c(str_remove(name, ".*\\."), collapse="_"),
.groups = 'drop') %>%
select(-rn) %>%
bind_cols(a, .)
-output
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
Or use pmap
library(purrr)
a %>%
mutate(name = pmap_chr(select(cur_data(), contains('.')), ~ {
nm1 <- c(...)
str_c(str_remove(names(nm1)[!is.na(nm1)], '.*\\.'), collapse="_")}))
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
Or use apply in base R
apply(a[-(1:2)], 1, function(x) paste(sub(".*\\.", "",
names(x)[!is.na(x)]), collapse="_"))
#[1] "ABC_TAR" "DEF_REP" "ABC_TAR_REP_FAR" "ABC_TAR_FAR"

I think my answer may be similar to others, still I feel syntax is written in tidyverse pipe style so may be easier to understand. Still someone, if feels it is copy of theirs, I will be happy to delete it.
a %>% mutate(name = pmap(select(cur_data(), contains('p')),
~ names(c(...))[!is.na(c(...))] %>%
str_remove_all(., "p.") %>%
paste(., collapse = '_')
)
)
id x p.ABC p.DEF p.TAR p.REP p.FAR name
1 A 2 1 NA 1 NA NA ABC_TAR
2 A 2 NA 1 NA 1 NA DEF_REP
3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
4 B 3 1 NA 1 NA 1 ABC_TAR_FAR
The idea behind it is actually we can use pipes inside of map/reduce family of functions so as to obviate the necessity of writing a custom function beforehand and also creating intermediate objects inside {}

Using rowwise :
library(dplyr)
cols <- grep('^p\\.', names(a), value = TRUE)
a %>%
rowwise() %>%
mutate(name = paste0(sub('p\\.', '',
cols[!is.na(c_across(starts_with('p')))]), collapse = '_')) %>%
ungroup
# id x p.ABC p.DEF p.TAR p.REP p.FAR name
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 A 2 1 NA 1 NA NA ABC_TAR
#2 A 2 NA 1 NA 1 NA DEF_REP
#3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
#4 B 3 1 NA 1 NA 1 ABC_TAR_FAR

Updated
Special thanks to dear #akrun for helping me improve my codes:
We just made a subtle modification to suppress a message produced by unnest_wider.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
a %>%
mutate(name = pmap(select(a, starts_with("p.")), ~ {nm1 <- names(c(...))[!is.na(c(...))];
setNames(nm1, seq_along(nm1))})) %>%
unnest_wider(name) %>%
rowwise() %>%
mutate(across(8:11, ~ str_remove(., fixed("p.")))) %>%
unite(NAME, c(8:11), sep = "_", na.rm = TRUE)
# A tibble: 4 x 8
id x p.ABC p.DEF p.TAR p.REP p.FAR NAME
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 A 2 1 NA 1 NA NA ABC_TAR
2 A 2 NA 1 NA 1 NA DEF_REP
3 B 3 1 NA 1 1 1 ABC_TAR_REP_FAR
4 B 3 1 NA 1 NA 1 ABC_TAR_FAR

How to summarise across different types of variables with dplyr::c_across()

I have data with different types of variables. Some are character, some factors, and some numeric, like below:
df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))
I'm trying to count the number of missing values per observation using c_across in dplyr
However, c_across doesn't seem to be able to combine different type of values, as the error message below suggests
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across())))
Error: Problem with summarise() input NAs.
x Can't combine a <factor> and b .
ℹ Input NAs is sum(is.na(c_across())).
ℹ The error occurred in row 1.
Indeed, if I include only numeric variables, it works.
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(b:c))))
Same thing if I include only character variables
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(c(a,d)))))
I could solve the issue without using c_across like below, but I have lots of variables, so it's not very practical.
df %>%
rowwise() %>%
summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))
I could use the traditional apply approach, like below, but I'd like to solve this using dplyr.
apply(df, 1, function(x)sum(is.na(x)))
Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr?

I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise() and then we can use c_across() function. Here the code (I have used you df data):
library(tidyverse)
#Code
df %>%
mutate_at(vars(everything()),funs(as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
And we can avoid the mutate_at() function using the new across() with mutate() to homologate the variables:
#Code 2
df %>%
mutate(across(a:d,~as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3

A much faster option is not to use rowwise or c_across, but with rowSums
library(dplyr)
df %>%
mutate(NAs = rowSums(is.na(.)))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 3
If we want to select certain columns i.e. numeric
df %>%
mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 1

Reorder a single column in a dataframe within each level of another column

Probably the solution to this problem is really easy but I just can't see it. Here is my sample data frame:
df <- data.frame(id=c(1,1,1,2,2,2), value=rep(1:3,2), level=rep(letters[1:3],2))
df[6,2] <- NA
And here is the desired output that I would like to create:
df$new_value <- c(3,2,1,NA,2,1)
So the order of all columns is the same, and for the new_value column the value column order is reversed within each level of the id column. Any ideas? Thanks!

As I understood your question, it's a coincidence that your data is sorted, if you just want to reverse the order without sorting:
library(dplyr)
df %>% group_by(id) %>% mutate(new_value = rev(value)) %>% ungroup
# A tibble: 6 x 4
id value level new_value
<dbl> <int> <fctr> <int>
1 1 1 a 3
2 1 2 b 2
3 1 3 c 1
4 2 1 a NA
5 2 2 b 2
6 2 NA c 1

A slightly different approach, using the parameters in the sort function:
library(dplyr)
df %>% group_by(id) %>%
mutate(value = sort(value, decreasing=TRUE, na.last=FALSE))
Output:
# A tibble: 6 x 3
# Groups: id [2]
id value level
<dbl> <int> <fctr>
1 1.00 3 a
2 1.00 2 b
3 1.00 1 c
4 2.00 NA a
5 2.00 2 b
6 2.00 1 c
Hope this helps!

We can use order on the missing values and on the column itself
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_value = value[order(!is.na(value), -value)])
# A tibble: 6 x 4
# Groups: id [2]
# id value level new_value
# <dbl> <int> <fctr> <int>
#1 1.00 1 a 3
#2 1.00 2 b 2
#3 1.00 3 c 1
#4 2.00 1 a NA
#5 2.00 2 b 2
#6 2.00 NA c 1
Or using the arrange from dplyr
df %>%
arrange(id, !is.na(value), desc(value)) %>%
transmute(new_value = value) %>%
bind_cols(df, .)
Or using base R and specify the na.last option as FALSE in order
with(df, ave(value, id, FUN = function(x) x[order(-x, na.last = FALSE)]))
#[1] 3 2 1 NA 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Summarize dataframe to avoid NAs - r

You can group_by id and use max with na.rm = TRUE: library(dplyr) df %>% group_by(id) %>% summarise(across(everything(), max, na.rm = TRUE)) id A B C 1 1 3 5 2 If multiple cases, max may not be what you want, you can use sum instead.

Using fmax from collapse library(collapse) fmax(df[-1], df$id) A B C 1 3 5 2

Alternatively please check the below code data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>% slice_head(n=1) Created on 2023-02-03 with reprex v2.0.2 # A tibble: 1 × 4 # Groups: id [1] id A B C <dbl> <dbl> <dbl> <dbl> 1 1 3 5 2

Related

How to remove missing values in summarise_all dplyr [duplicate]

Including missing values in summarise output

Using the dplyr library in R to "print" the name of the non-NA columns

How to summarise across different types of variables with dplyr::c_across()

Reorder a single column in a dataframe within each level of another column

Categories

Resources