Including missing values in summarise output

Including missing values in summarise output - r

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9))
I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
outputs
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.
How can I do this?

If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]
library(dplyr)
dat %>%
group_by(id) %>%
summarise(seq_0_time = time[seq_num ==0][1],
seq_1_time = time[seq_num == 1][1], .groups = 'drop')
-output
# A tibble: 3 × 3
id seq_0_time seq_1_time
<dbl> <dbl> <dbl>
1 1 4 5
2 2 6 7
3 3 9 NA
It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur
> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA
Or using length<-
> `length<-`(numeric(0), 3)
[1] NA NA NA

This can actually be pretty easily solved using reshape.
> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
id time.0 time.1 time.2
1 1 4 5 NA
3 2 6 7 8
6 3 9 NA NA

My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:
library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
seq_num=c(0:1,0:2,0:0),
time=c(4,5,6,7,8,9)) %>%
complete(seq_num = seq_num,
id = id)
dat %>%
group_by(id, .drop=FALSE) %>%
summarise(seq_0_time = time[seq_num==0],
seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#> id seq_0_time seq_1_time
#> <dbl> <dbl> <dbl>
#> 1 1 4 5
#> 2 2 6 7
#> 3 3 9 NA
Created on 2022-04-20 by the reprex package (v2.0.1)

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().

You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.

Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2

Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

How to remove missing values in summarise_all dplyr [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 7 months ago.
I'm having trouble to exclude missing values in summarise_all function.
I have a dataset (df) as shown below and basically I'm having two problems:
excluding missing values and the output only being one number
additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)
df1 dataset is the one I'm trying to get to.
Here's the whole enchilada:
df #the original dataset
ID type of data genes1 genes2 genes3 ...
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1
...
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code
#output
ID type of data genes1 genes2 genes3 ...
1 c("new","old","suggested") c(2,NA,NA) c(0,NA,NA) c(2,NA,NA)
1 TRUE TRUE TRUE TRUE
2 c("new","old","suggested") c(1,NA,NA) c(1,NA,NA) c(1,NA,NA)
2 TRUE TRUE TRUE TRUE
...
#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this
df1 #dream dataset
ID type of data genes1 genes2 genes3 ...
1 #doesn't matter 2 0 2
2 #doesn't matter 1 1 1
...
I also tried using na.omit in summarise_all but it didn't really fix anything.
Does anybody have any ideas on how to fix it?

You could do:
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('genes'), ~.[!is.na(.)]))
#> # A tibble: 2 × 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

Another way
library (dplyr)
df[-2] |>
group_by(ID) |>
fill(genes1:genes3, .direction = "downup") |>
slice(1)
ID genes1 genes2 genes3
<int> <int> <int> <int>
1 1 2 0 2
2 2 1 1 1

An alternative approach based on the coalesce() function from tidyr
In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.
We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.
Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.
library(dplyr)
library(tidyr)
# Define a function to coalesce by column
coalesce_by_column <- function(df) {
coalesce(!!! as.list(df))
}
# Remove NA rows
df %>%
select(-type) %>%
group_by(ID) %>%
summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#> ID genes1 genes2 genes3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 2
#> 2 2 1 1 1

If you are not worried about the type column you can do something like this
library(tidyverse)
" ID type genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1" %>%
read_table() -> df
df %>%
pivot_longer(-c(ID, type)) %>%
drop_na(value) %>%
select(-type) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
ID genes1 genes2 genes3
<dbl> <dbl> <dbl> <dbl>
1 1 2 0 2
2 2 1 1 1

If you want to keep the "type of data" column while using summarise, you can use the following code:
df <- read.table(text = "ID type_of_data genes1 genes2 genes3
1 new 2 NA NA
1 old NA 0 NA
1 suggested NA NA 2
2 new 1 NA NA
2 old NA 1 NA
2 suggested NA NA 1", header = TRUE)
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
summarise(across(starts_with("genes"), na.omit),
type_of_data = type_of_data[genes1]) %>%
ungroup()
df1
#> # A tibble: 2 × 5
#> ID genes1 genes2 genes3 type_of_data
#> <int> <int> <int> <int> <chr>
#> 1 1 2 0 2 old
#> 2 2 1 1 1 new
Created on 2022-07-26 by the reprex package (v2.0.1)

Why does case_when() compute false condition?

I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.

case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1

Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1

This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!

If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1

This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Including missing values in summarise output - r

This can actually be pretty easily solved using reshape. > reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide') id time.0 time.1 time.2 1 1 4 5 NA 3 2 6 7 8 6 3 9 NA NA

Related

R - Summarize dataframe to avoid NAs

How to remove missing values in summarise_all dplyr [duplicate]

Why does case_when() compute false condition?

Insert missing rows in time series data

R add rows to grouped df using dplyr

Categories

Resources