Dplyr pipe groupby top_n does not get top_n in group - r

I'm trying to obtain the top 2 names, sorted alphabetically, per group. I would think that top_n() would select this after I perform a group_by. However, this does not seem to be the case. This code shows the problem.
df <- data.frame(Group = c(0, 0, 0, 1, 1, 1),
Name = c("a", "c", "b", "e", "d", "f"))
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
# A tibble: 2 x 2
# Groups: Group [1]
Group Name
<dbl> <chr>
1 1 e
2 1 f
Expected output would be:
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
Group Name
1 0 a
2 0 b
3 1 d
4 1 e
Or something similar. Thanks.

top_n selects top n max values. You seem to need top n min values. You can use index with negative values to get that. Additionaly you don't need to arrange the data when using top_n.
library(dplyr)
df %>% group_by(Group) %>% top_n(-2, Name)
# Group Name
# <dbl> <chr>
#1 0 a
#2 0 b
#3 1 e
#4 1 d
Another way is to arrange the data and select first two rows in each group.
df %>% arrange(Group, Name) %>% group_by(Group) %>% slice(1:2)

We can use
library(dplyr)
df %>%
arrange(Group, Name) %>%
group_by(Group) %>%
filter(row_number() < 3)

Related

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Value based on largest value by neighbouring column

Using group_by() I want to get the value of column value based on the largest value of column value2:
df = data.frame(id = c(1,1,1,1,2,2,2,2),
value = c(4,5,1,3,1,2,3,1),
value2 = c("a","b","c","d","e","f","g","h"))
df %>% group_by(id) %>%
sumarise(value2_of_largest_value = f(value, value2))
1 b
2 g
We can use which.max to get the index of the value and use that to subset the value2
library(dplyr)
f1 <- function(x, y) y[which.max(x)]
df %>%
group_by(id) %>%
summarise(value2 = f1(value, value2))
#or simply
# summarise(value2 = value2[which.max(value)])
# A tibble: 2 x 2
# id value2
# <dbl> <fct>
#1 1 b
#2 2 g
Another approach in dplyr:
library(dplyr)
df1 %>%
group_by(id) %>%
filter(value == max(value))
or in data.table:
library(data.table)
setDT(df1)[setDT(df1)[, .I[value == max(value)], by=id]$V1]

Group by, pivot, count and sum in DF in R

I have a date frame with the fields PARTIDA (date), Operação (4 levels factor) and TT (numeric) .
I need to group by the PARTIDA column, pivot the Operation column counting to the frequency of each level and sum the TT column.
Like this:
I already tried something with dplyr but I could not get this result, can anyone help me?
Here's a two-step process that may get you what you want:
library(dplyr)
df <-
tibble(
partida = c("date1", "date2", "date3", "date1", "date2"),
operacao = c("D", "J", "C", "D", "M"),
tt = c(1, 2, 3, 4, 5)
)
tt_sums <-
df %>%
group_by(partida) %>%
count(wt = tt)
operacao_counts <-
df %>%
group_by(partida, operacao) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)
final_df <-
operacao_counts %>%
left_join(tt_sums, by = "partida")
> final_df
# A tibble: 3 x 6
partida C D J M n
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 date1 0 2 0 0 5
2 date2 0 0 1 1 7
3 date3 1 0 0 0 3
Similar to #cardinal40's answer but in one go as I try to limit the number of objects added to my environment when possible. Either answer will do the trick.
df %>%
group_by(partida) %>%
mutate(tt = sum(tt)) %>%
group_by(partida, operacao, tt) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)

R Sum numbers within string

I have a question:
I have a dataset like this simple example:
df<-data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/+2","19/18/14"))
I want to end up with a dataset that has split the score numbers. I have a problem with the /+2 part. when it says "3/+2"it actually means: "3/3+2" which would finally give "3/5". So what I would like some help with, is to end up with a dataset like this:
ID Score
A 15
B 16/18/19/21/6
C 3/5
D 19/18/14
I already found out that I can then seperate the score by
df<-df %>%
mutate(Score = strsplit(as.character(ID), "/")) %>%
unnest(Score)
But I don't know how I can let the numbers duplicate and then sum when /+ occurs, could someone help me?
It could be probably solved in a more elegant way, but here is one possibility:
df %>%
mutate(Score = strsplit(as.character(Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/21/6
3 C 3/5
4 D 19/18/14
Sample data:
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"))
It splits "Score" based on /, converts characters to expression by parse() and then transforms it back.
Using the data you provided and the pattern from #A. Suliman:
df %>%
mutate(Score = strsplit(gsub("(\\d+)/*\\+(\\d+)","\\1/\\1+\\2", Score), "/")) %>%
unnest() %>%
rowwise() %>%
mutate(Score = eval(parse(text = paste0(Score)))) %>%
group_by(ID) %>%
mutate(Score = paste0(Score, collapse = "/")) %>%
distinct()
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14
We can use gsubfn to do this in a compact way
library(gsubfn)
library(tidyverse)
df %>%
mutate(Score = gsubfn("\\d+\\+\\d+", ~ eval(parse(text = x)), Score))
# ID Score
#1 A 15
#2 B 16/18/21/6
#3 C 3/5
#4 D 19/18/14
data
df <- data.frame(ID=c("A","B","C","D"),
Score=c("15","16/18/19+2/6","3/3+2","19/18/14"), stringsAsFactors = FALSE)
library(dplyr)
library(tidyr) #separate_rows, no need for unnest
df %>% rowwise()%>%
mutate(Score_upd=paste0(sapply(unlist(strsplit(gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score),'/')),
function(x)eval(parse(text = x))),collapse = '/')) %>%
separate_rows(Score_upd,sep = '/')
#short version
df %>% mutate(Score=gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2',Score)) %>%
separate_rows(Score,sep='/') %>% rowwise() %>% mutate(Score=eval(parse(text=Score))) %>%
group_by(ID) %>% summarise(Score=paste0(Score,collapse = '/'))
# A tibble: 4 x 2
ID Score
<fct> <chr>
1 A 15
2 B 16/18/19/21/6
3 C 3/5
4 D 19/18/14
The main idea is using gsub to separate 2+3 correctly, e.g:
gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3') #/* means 0 or 1 occurence of / e.g, 19+2 and 3/+2.
[1] "20/8/2/2+3"
Then
valid_str <- gsub('(\\d+)/*\\+(\\d+)','\\1/\\1+\\2','20/8/2+3')
sapply(unlist(strsplit(valid_str,'/')),function(x) eval(parse(text=x)))
20 8 2 2+3
20 8 2 5
#OR
sapply(unlist(strsplit(valid_str,'/')),function(x) sum(as.numeric(unlist(strsplit(x,'\\+')))))
20 8 2 2+3
20 8 2 5

dplyr conditional count function in R language

In the following code, the 2rd payment for item b is zero value. Using the pipe %>%, is possible to show the count for item b as 2 not 3 since there is no payment for b at the 2rd payment?
df <-data.frame("item" = c("a","b", "b","a","b"), "payment" = c(10,20,0,40,30) )
df_sum <-
df %>%
group_by(item) %>%
summarise(total = sum(payment),
totalcount =n())
You can filter out rows you don't want. E.g., if you don't want to count rows where payment = 0, you can use filter:
df %>%
group_by(item) %>%
filter(payment > 0) %>%
summarise(total = sum(payment),
totalcount =n())
# A tibble: 2 x 3
item total totalcount
<fct> <dbl> <int>
1 a 50.0 2
2 b 50.0 2

Resources