Last element in group omitting NA - - r

I am looking for a way to get last element in group omitting NA. Standard dplyr solution is not working and it is not clear when it is going to be fixed issue
Can anybody suggest work around?
Here is an example of what I am looking for
df <- DataFrame(col_1 = c('A', 'A', 'B', 'B'), col_2 = c(1, NA, 3, 3))
So I would like to group by col_1 and for group A return 1 and for group B return 3

One way to do it is to use na.omit and tail:
df %>% group_by(col_1) %>% summarise(last=tail(na.omit(col_2),1))
col_1 last
<fctr> <dbl>
1 A 1
2 B 3
Or you could filter your dataframe, then slice the last row per group:
df %>% filter(!is.na(col_2)) %>% group_by(col_1) %>% slice(n())

After grouping by 'col_1', arrange using the logical vector is.na(col_2) and slice the first element
library(dplyr)
df %>%
group_by(col_1)%>%
arrange(is.na(col_2)) %>%
slice(1)
# A tibble: 2 x 2
# Groups: col_1 [2]
# col_1 col_2
# <fctr> <dbl>
#1 A 1
#2 B 3

Related

Select second largest row by group in r

I have this problem
library(dplyr)
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
For each id, I want to only keep row with second largest value of var1. I tried different ways (dplyr, base R):
problem %>%
group_by(id) %>%
slice_tail(2, -var1)
problem[with(problem, ave(var1, id, FUN = function(x) x == tail(sort(x), 2)[1])), ]
First code doesn;t work, second code gives wrong answer.
What am I doing wrong?
problem |> group_by(id) %>% arrange(var1) %>% slice(n()-1)
n() counts the number of rows in each group. slice(n()-1) takes the n-1th element. Note this will cause issues with groups with fewer than 2 members - you may wish to allow for that.
If you wish to use slice, I guess you can first slice_max() the largest two rows, than slice_tail to remove the largest row.
library(dplyr)
problem %>%
group_by(id) %>%
slice_max(var1, n = 2) %>%
slice_tail(n = 1)
Or you can use a single filter:
problem %>% group_by(id) %>% filter(var1 == max(var1[var1 != max(var1)]))
Output
# A tibble: 2 × 3
# Groups: id [2]
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9
In case you have volume, here is a data.tableapproach.
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
setDT(problem)
setorder(problem, id, - var1)
problem[, .SD[2], by=id]
As for #paul Stafford Allen comment, you will have issue for groups of size only 1.
After arrangeing the 'var1' on descending use slice with 2
library(dplyr)
problem %>%
arrange(id, desc(var1)) %>%
group_by(id) %>%
slice(2) %>%
ungroup
-output
# A tibble: 2 × 3
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

remove group if any member containes NA in R

How can I remove entire group if one of its values is NA. For ex - remove category B because it contains NA.
library(dplyr)
tbl = tibble(category = c("A", "A", "B", "B"),
values = c(2, 3, 1, NA))
We can use filter after grouping by 'category'
library(dplyr)
tbl %>%
group_by(category) %>%
filter(!any(is.na(values))) %>%
ungroup
-output
# A tibble: 2 x 2
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
filter(!category %in% category[is.na(values)])
Output
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
group_by(category) %>%
filter(all(!is.na(values)))
category values
<chr> <dbl>
1 A 2
2 A 3
You can get the categories which has at least one NA value and exclude them.
subset(tbl, !category %in% unique(category[is.na(values)]))
# category values
# <chr> <dbl>
#1 A 2
#2 A 3
If you prefer dplyr::filter.
library(dplyr)
tbl %>% filter(!category %in% unique(category[is.na(values)]))

How to use slice in dplyr to keep the rows with NA values in R

I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

filter() but keep groups without value

I am trying to condense a grouped df, pulling out only rows that contain a certain value, but that value isn't reflected in all groups. I want to find a way to pull out all rows with that value, but also make a NA or 0 row for groups not containing that value.
Ex:
x1 <- c('1','1','1','1','1','2','2','2','2','2','3','3','3','3','3')
x2 <- c('a','b','c','d','e','b','c','d','e','f','a','b','d','e','f')
df <- data.frame(x1,x2)
df %>% group_by(x1) %>%
filter(x2 =="a")
this returns:
x1 x2
<fct> <fct>
1 1 a
2 3 a
but I want it to return:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
Obviously the real code is much more complicated, so I'm looking for the best way to keep these empty groups in a reproducible way.
PS - I would like to stay in dplyr to keep smooth in a function chain
Thanks!
One dplyr option could be:
df %>%
group_by(x1) %>%
slice(which.max(x2 == "a")) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_))
x1 x2
<fct> <fct>
1 1 a
2 2 <NA>
3 3 a
If it's relevant to have multiple target values per group:
df %>%
group_by(x1) %>%
filter(x2 == "a") %>%
bind_rows(df %>%
group_by(x1) %>%
filter(all(x2 != "a")) %>%
slice(1) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_)))
As you did not specify dplyr solutions only, here's one option with library(data.table)
setDT(df)
df[, .(x2 = x2[match('a', x2)]), x1]
# x1 x2
# 1: 1 a
# 2: 2 <NA>
# 3: 3 a
This happens because of the way Dplyr was written.
According to Hadley Wickham (the Package Creator) to maintain NA values you should declare that you want them explicitly. As he said in this issue on github, you should filter(a == x | is.na(a)). In your case you use the following:
df %>% group_by(x1) %>%
filter(x2 =="a" | is.na(x2)
That you'll return you this as a result:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
In this code you're asking to R all rows in which x2 is equal to "a" and also those in which x2 is NA.
We can use complete after the filter step to get the missing combinations. By default, all the other columns will be filled with NA (it can be made to custom value with fill argument)
library(dplyr)
library(tidyr)
df %>%
filter(x2 == 'a') %>%
complete(x1 = unique(df$x1))
# A tibble: 3 x 2
# x1 x2
# <fct> <fct>
#1 1 a
#2 2 <NA>
#3 3 a
Another option is match
df %>%
group_by(x1) %>%
summarise(x2 = x2[match('a', x2)])
If there are many columns, then mutate 'x2' with match and then slice the first row
df %>%
group_by(x1) %>%
mutate(x2 = x2[match('a', x2)]) %>%
slice(1)
How about the base R solution using aggregate() like below?
dfout <- aggregate(x2~x1,df,function(v) ifelse("a" %in% v,"a",NA))
or
dfout <- aggregate(x2~x1,df,function(v) v[match("a", v)])
such that
> dfout
x1 x2
1 1 a
2 2 <NA>
3 3 a

Resources