Value based on largest value by neighbouring column - r

Using group_by() I want to get the value of column value based on the largest value of column value2:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
value = c(4,5,1,3,1,2,3,1),
value2 = c("a","b","c","d","e","f","g","h"))
df %>% group_by(id) %>%
sumarise(value2_of_largest_value = f(value, value2))
1 b
2 g

We can use which.max to get the index of the value and use that to subset the value2
library(dplyr)
f1 <- function(x, y) y[which.max(x)]
df %>%
group_by(id) %>%
summarise(value2 = f1(value, value2))
#or simply
# summarise(value2 = value2[which.max(value)])
# A tibble: 2 x 2
# id value2
# <dbl> <fct>
#1 1 b
#2 2 g

Another approach in dplyr:
library(dplyr)
df1 %>%
group_by(id) %>%
filter(value == max(value))
or in data.table:
library(data.table)
setDT(df1)[setDT(df1)[, .I[value == max(value)], by=id]$V1]

Related

Filter all group if one element satisfy a condition in dplyr

I have the following data:
df1 <- data.frame( id = c(1,1,2) , a = c('a','b','c'))
> df1
id a
1 1 a
2 1 b
3 2 c
I would like to filter (remove) all element of a group defined on a variable if at least one element of the group satisfy a given condition.
Here, suppress lines corresponding to an id if variable a equals 'a'.
I tried the following, which removes only the line with a == 'a' as I wish the group with id == 1 would be filtered
> df1 %>%
+ group_by( id ) %>%
+ filter( a != 'a')
# A tibble: 2 × 2
# Groups: id [2]
id a
<dbl> <chr>
1 1 b
2 2 c
Any help would be welcome
Perhaps use all
library(dplyr)
df1 %>%
group_by(id) %>%
filter(all(a != 'a')) %>%
ungroup
Or with any
df1 %>%
group_by(id) %>%
filter(!any(a == 'a')) %>%
ungroup
Or may use %in% as well
df1 %>%
group_by(id) %>%
filter(!'a' %in% a) %>%
ungroup

Dplyr pipe groupby top_n does not get top_n in group

I'm trying to obtain the top 2 names, sorted alphabetically, per group. I would think that top_n() would select this after I perform a group_by. However, this does not seem to be the case. This code shows the problem.
df <- data.frame(Group = c(0, 0, 0, 1, 1, 1),
Name = c("a", "c", "b", "e", "d", "f"))
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
# A tibble: 2 x 2
# Groups: Group [1]
Group Name
<dbl> <chr>
1 1 e
2 1 f
Expected output would be:
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
Group Name
1 0 a
2 0 b
3 1 d
4 1 e
Or something similar. Thanks.
top_n selects top n max values. You seem to need top n min values. You can use index with negative values to get that. Additionaly you don't need to arrange the data when using top_n.
library(dplyr)
df %>% group_by(Group) %>% top_n(-2, Name)
# Group Name
# <dbl> <chr>
#1 0 a
#2 0 b
#3 1 e
#4 1 d
Another way is to arrange the data and select first two rows in each group.
df %>% arrange(Group, Name) %>% group_by(Group) %>% slice(1:2)
We can use
library(dplyr)
df %>%
arrange(Group, Name) %>%
group_by(Group) %>%
filter(row_number() < 3)

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Count non-NA values by group [duplicate]

This question already has answers here:
R group by, counting non-NA values
(3 answers)
Closed 4 years ago.
Here is my example
mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))
I would like to group by col_1 and count non-NA elements in col_2
I would like to do it with dplyr. Here is what I tried:
mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))
Nothing worked. Any suggestions?
You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2
We can filter the NA elements in 'col_2' and then do a count of 'col_1'
mydf %>%
filter(!is.na(col_2)) %>%
count(col_1)
# A tibble: 2 x 2
# col_1 n
# <fctr> <int>
#1 A 1
#2 B 2
or using data.table
library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]
Or with aggregate from base R
aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
# col_1 col_2
#1 A 1
#2 B 2
Or using table
table(mydf$col_1[!is.na(mydf$col_2)])
library(knitr)
library(dplyr)
mydf <- data.frame("col_1" = c("A", "A", "B", "B"),
"col_2" = c(100, NA, 90, 30))
mydf %>%
group_by(col_1) %>%
select_if(function(x) any(is.na(x))) %>%
summarise_all(funs(sum(is.na(.)))) -> NA_mydf
kable(NA_mydf)

How to pass a variable name to dplyr's group_by()

I can calculate the rank of the values (val) in my dataframe df within the group name1 with the code:
res <- df %>% arrange(val) %>% group_by(name1) %>% mutate(RANK=row_number())
Instead of writing the column "name1" in the code, I want to pass it as variable, eg crit = "name1". However, the code below does not work since crit1 is assumed to be the column name instead of a variable name.
res <- df %>% arrange(val) %>% group_by(crit1) %>% mutate(RANK=row_number())
How can I pass crit1 in the code?
Thanks.
We can use group_by_
library(dplyr)
df %>%
arrange(val) %>%
group_by_(.dots=crit1) %>%
mutate(RANK=row_number())
#Source: local data frame [10 x 4]
#Groups: name1, name2 [7]
# val name1 name2 RANK
# <dbl> <chr> <chr> <int>
#1 -0.848370044 b c 1
#2 -0.583627199 a a 1
#3 -0.545880758 a a 2
#4 -0.466495124 b b 1
#5 0.002311942 a c 1
#6 0.266021979 c a 1
#7 0.419623149 c b 1
#8 0.444585270 a c 2
#9 0.536585304 b a 1
1#0 0.847460017 a c 3
Update
group_by_ is deprecated in the recent versions (now using dplyr version - 0.8.1), so we can use group_by_at which takes a vector of strings as input variables
df %>%
arrange(val) %>%
group_by_at(crit1) %>%
mutate(RANK=row_number())
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!)
df %>%
arrange(val) %>%
group_by(!!! rlang::syms(crit1)) %>%
mutate(RANK = row_number())
data
set.seed(24)
df <- data.frame(val = rnorm(10), name1= sample(letters[1:3], 10, replace=TRUE),
name2 = sample(letters[1:3], 10, replace=TRUE),
stringsAsFactors=FALSE)
crit1 <- c("name1", "name2")
Update with dplyr 1.0.0
The new across syntax eliminates the need for !!! rlang::syms(). So you can now simplify the code by:
df %>%
arrange(val) %>%
group_by(across(all_of(crit1))) %>%
mutate(RANK = row_number())
Facing a similar task I could successfully work with these two options.
Use across():
for (crit in names(df)) {
print(df |>
# all_of() is not needed here
group_by(across(crit)) |>
count())
}
Use syms() and !!:
crits = syms(names(df))
for (crit in crits) {
print(df |>
# the use of !! instead of !!! is now encouraged
group_by(!!crit) |>
count())
}

Resources