Group by, pivot, count and sum in DF in R - r

I have a date frame with the fields PARTIDA (date), Operação (4 levels factor) and TT (numeric) .
I need to group by the PARTIDA column, pivot the Operation column counting to the frequency of each level and sum the TT column.
Like this:
I already tried something with dplyr but I could not get this result, can anyone help me?

Here's a two-step process that may get you what you want:
library(dplyr)
df <-
tibble(
partida = c("date1", "date2", "date3", "date1", "date2"),
operacao = c("D", "J", "C", "D", "M"),
tt = c(1, 2, 3, 4, 5)
)
tt_sums <-
df %>%
group_by(partida) %>%
count(wt = tt)
operacao_counts <-
df %>%
group_by(partida, operacao) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)
final_df <-
operacao_counts %>%
left_join(tt_sums, by = "partida")
> final_df
# A tibble: 3 x 6
partida C D J M n
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 date1 0 2 0 0 5
2 date2 0 0 1 1 7
3 date3 1 0 0 0 3

Similar to #cardinal40's answer but in one go as I try to limit the number of objects added to my environment when possible. Either answer will do the trick.
df %>%
group_by(partida) %>%
mutate(tt = sum(tt)) %>%
group_by(partida, operacao, tt) %>%
count() %>%
ungroup() %>%
spread(operacao, n) %>%
mutate_if(is.numeric, replace_na, 0)

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

creating dataframe from vectors

enter image description hereI have the following vectors:
bid = c(1,5,10,20,30,40,50)
n = c(31,29,27,25,23,21,19)
yes = c(0,3,6,7,9,13,17)
no = n - yes
I have two questions, and I don't find any solutions for them, I would appreciate if someone can help me.
Q1: I want to write R code to create a two-column dataframe df. Column 1 has Bid,
where each Bid is repeated n times; Column 2 has c(rep(1,yes),rep(0,no) at
each bid.
Q2: Then when I have the data frame df, I want to write R codes to generate
(from df) vectors bid, n, yes, and no, again.
It is a bit unclear what you actually want. It is easier if you provide the desired result. Would this fit your Q1:
library(tidyverse)
bid = c(1,5,10,20,30,40,50)
n = c(31,29,27,25,23,21,19)
yes = c(0,3,6,7,9,13,17)
no = n - yes
df <- tibble(bid, yes, n, no = n -yes) %>% dplyr::select(- n) %>% pivot_longer(cols = c(yes, no)) %>% uncount(value) %>% mutate(yesno = ifelse(name == "yes", 1,0)) %>% dplyr::select(-name)
df2 <- df %>% group_by(bid) %>% table() %>% as.data.frame() %>% pivot_wider(id_cols = bid, names_from = yesno, values_from = Freq) %>% mutate(n = yes + no) %>% rename(no = `0`, yes = `1`)
bid <- df2$bid
n <- df2$n
yes <- df2$yes
I don't know what you mean for Q2, but for Q1 you could do this:
library(tidyverse)
pmap_dfr(list(bid, n, yes, no),
\(V1, V2, V3, V4) tibble(col1 = rep(V1, V2),
col2 = c(rep(1,V3),rep(0,V4))))
#> # A tibble: 175 x 2
#> col1 col2
#> <dbl> <dbl>
#> 1 1 0
#> 2 1 0
#> 3 1 0
#> 4 1 0
#> 5 1 0
#> 6 1 0
#> 7 1 0
#> 8 1 0
#> 9 1 0
#> 10 1 0
#> # ... with 165 more rows
EDIT:
For Q2, you can follow this:
library(tidyverse)
df <- pmap_dfr(list(bid, n, yes, no),
\(V1, V2, V3, V4) tibble(col1 = rep(V1, V2),
col2 = c(rep(1,V3),rep(0,V4))))
df2 <- df |>
count(col1, col2) |>
group_by(col1) |>
summarise(yes = sum(n[col2==1]),
n = sum(n))
bid2 <- df2$col1
n2 <- df2$n
yes2 <- df2$yes
no2 <- n2 - yes2
all.equal(c(bid, n, yes, no), c(bid2, n2, yes2, no2))
#> [1] TRUE

Dplyr pipe groupby top_n does not get top_n in group

I'm trying to obtain the top 2 names, sorted alphabetically, per group. I would think that top_n() would select this after I perform a group_by. However, this does not seem to be the case. This code shows the problem.
df <- data.frame(Group = c(0, 0, 0, 1, 1, 1),
Name = c("a", "c", "b", "e", "d", "f"))
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
# A tibble: 2 x 2
# Groups: Group [1]
Group Name
<dbl> <chr>
1 1 e
2 1 f
Expected output would be:
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
Group Name
1 0 a
2 0 b
3 1 d
4 1 e
Or something similar. Thanks.
top_n selects top n max values. You seem to need top n min values. You can use index with negative values to get that. Additionaly you don't need to arrange the data when using top_n.
library(dplyr)
df %>% group_by(Group) %>% top_n(-2, Name)
# Group Name
# <dbl> <chr>
#1 0 a
#2 0 b
#3 1 e
#4 1 d
Another way is to arrange the data and select first two rows in each group.
df %>% arrange(Group, Name) %>% group_by(Group) %>% slice(1:2)
We can use
library(dplyr)
df %>%
arrange(Group, Name) %>%
group_by(Group) %>%
filter(row_number() < 3)

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Count non-NA values by group [duplicate]

This question already has answers here:
R group by, counting non-NA values
(3 answers)
Closed 4 years ago.
Here is my example
mydf<-data.frame('col_1' = c('A','A','B','B'), 'col_2' = c(100,NA, 90,30))
I would like to group by col_1 and count non-NA elements in col_2
I would like to do it with dplyr. Here is what I tried:
mydf %>% group_by(col_1) %>% summarise_each(funs(!is.na(col_2)))
mydf %>% group_by(col_1) %>% mutate(non_na_count = length(col_2, na.rm=TRUE))
mydf %>% group_by(col_1) %>% mutate(non_na_count = count(col_2, na.rm=TRUE))
Nothing worked. Any suggestions?
You can use this
mydf %>% group_by(col_1) %>% summarise(non_na_count = sum(!is.na(col_2)))
# A tibble: 2 x 2
col_1 non_na_count
<fctr> <int>
1 A 1
2 B 2
We can filter the NA elements in 'col_2' and then do a count of 'col_1'
mydf %>%
filter(!is.na(col_2)) %>%
count(col_1)
# A tibble: 2 x 2
# col_1 n
# <fctr> <int>
#1 A 1
#2 B 2
or using data.table
library(data.table)
setDT(mydf)[, .(non_na_count = sum(!is.na(col_2))), col_1]
Or with aggregate from base R
aggregate(cbind(col_2 = !is.na(col_2))~col_1, mydf, sum)
# col_1 col_2
#1 A 1
#2 B 2
Or using table
table(mydf$col_1[!is.na(mydf$col_2)])
library(knitr)
library(dplyr)
mydf <- data.frame("col_1" = c("A", "A", "B", "B"),
"col_2" = c(100, NA, 90, 30))
mydf %>%
group_by(col_1) %>%
select_if(function(x) any(is.na(x))) %>%
summarise_all(funs(sum(is.na(.)))) -> NA_mydf
kable(NA_mydf)

Resources