Include empty factor levels in tally with tidyr and dplyr - r

a question as a learn dplyr and its ilk.
I am calculating a tally and a relative frequency of a factor conditioned on two other variables in a df. For instance:
library(dplyr)
library(tidyr)
set.seed(3457)
pct <- function(x) {x/sum(x)}
foo <- data.frame(x = rep(seq(1:3),20),
y = rep(rep(c("a","b"),each=3),10),
z = LETTERS[floor(runif(60, 1,5))])
bar <- foo %>%
group_by(x, y, z) %>%
tally %>%
mutate(freq = (n / sum(n)) * 100)
head(bar)
I'd like the output, bar, to include all the levels of foo$z. I.e., there are no cases of C here:
subset(bar, x==2 & y=="a")
How can I have bar tally the missing levels so I get:
subset(bar, x==2 & y=="a",select = n)
to return 4, 5, 0, 1 (and select = freq to give 40, 50, 0, 10)?
Many thanks.
Edit: Ran with the seed set!

We can use complete from tidyr
bar1 <- bar %>%
complete(z, nesting(x, y), fill = list(n = 0, freq = 0))%>%
select_(.dots = names(bar))
filter(bar1, x==2 & y=="a")
# x y z n freq
# <int> <fctr> <fctr> <dbl> <dbl>
#1 2 a A 4 40
#2 2 a B 5 50
#3 2 a C 0 0
#4 2 a D 1 10

Related

Summarize one variable/column over all possible values of other variables/columns

I need to summarize one variable/column of a long table after aggregating (group_by()) by another variable/column, I need to have the summarized value by all values of other variables/columns.
Here is test data:
library(tidyverse)
set.seed(123)
Site <- str_c("S", 1:5)
Species <- str_c("Sps", 1:6)
print(Species_tbl <- bind_cols(Species = Species,
Exotic = rbinom(length(Species), 1, .3),
Migrant = rbinom(length(Species), 2, .3)))
Data_tbl <- expand.grid(Site = Site,
Species = Species) %>%
left_join(Species_tbl)
Data_tbl$Presence <- rbinom(nrow(Data_tbl), 1, .5)
And here is my best effort:
print(Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence),
N_sp_Exo = sum(Presence[Exotic == 1]),
N_sp_Nat = sum(Presence[Exotic == 0]),
N_sp_M0 = sum(Presence[Migrant == 0]),
N_sp_M1 = sum(Presence[Migrant == 1]),
N_sp_M2 = sum(Presence[Migrant == 2])))
You can get the data in long format for your columns of interest c(Exotic, Migrant) and take sum of Presence columns for each unique column names and it's values. This can be merged with sum of each Site.
library(dplyr)
library(tidyr)
data1 <- Data_tbl %>%
group_by(Site) %>%
summarise(N_sp = sum(Presence))
data2 <- Data_tbl %>%
pivot_longer(cols = c(Exotic, Migrant)) %>%
group_by(Site, name, value) %>%
summarise(result = sum(Presence), .groups = "drop") %>%
pivot_wider(names_from = c(name, value), values_from = result)
inner_join(data1, data2, by = 'Site')
# Site N_sp Exotic_0 Exotic_1 Migrant_0 Migrant_1 Migrant_2
# <fct> <int> <int> <int> <int> <int> <int>
#1 S1 4 2 2 1 2 1
#2 S2 3 2 1 0 2 1
#3 S3 2 1 1 0 2 0
#4 S4 4 2 2 1 3 0
#5 S5 4 1 3 1 2 1
The answer has been divided in two steps for ease of readability. If you would like to do this in a single chain without creating temporary variables that can be done as well.

Adding a Proportion Column with Dplyr

Let's say I had the following data frame, that was also altered to include counts of a,b, and c, based on whether or not they are classified by Z = 0 or 1
X <- (1:10)
Y<- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
(the code below was provided by a stack exchange member, thank you!)
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
How might I add a column on the right which would indicate the proportion of each of the letters for which z=1, out of all appearances of that letter? It would seem that a basic summary statement should work but I figure out how...
My expected output would be something like
Z=0 Z=1 PropZ=1
a 2 2 .5
b 1 2 .66
c 0 3 1
Perhaps this helps
library(dplyr)
library(tidyr)
test_df %>%
group_by(Y, Z) %>%
summarise(N = n(), .groups = 'drop') %>%
left_join(test_df %>%
group_by(Y) %>%
summarise(Prop = mean(Z == 1), .groups = 'drop')) %>%
pivot_wider(names_from = Z, values_from = N, values_fill = 0)
-output
# A tibble: 3 x 4
# Y Prop `0` `1`
# <chr> <dbl> <int> <int>
#1 a 0.5 2 2
#2 b 0.667 1 2
#3 c 1 0 3
test_df %>% group_by(Y) %>%
summarise( z0 = sum(Z == 0), z1 = sum(Z == 1) , PropZ = z1/n())
I am not sure if what is your expected output, but below might be some options
u <- xtabs(q ~ Y + Z, cbind(test_df, q = 1))
> u
Z
Y 0 1
a 2 2
b 1 2
c 0 3
or
> prop.table(u)
Z
Y 0 1
a 0.2 0.2
b 0.1 0.2
c 0.0 0.3
To calculate proportions of 1 for each letter you can use rowSums.
transform(res, prop_1 = `1`/rowSums(res[-1]))
In dplyr :
library(dplyr)
res %>%
ungroup %>%
mutate(prop_1 = `1`/rowSums(.[-1]))
# Y `0` `1` prop_1
# <chr> <int> <int> <dbl>
#1 a 2 2 0.5
#2 b 1 2 0.667
#3 c 0 3 1

How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

I'm trying to find the top 3 factor levels within each group, based on an aggregating variable, and group the remaining factor levels into "other" for each group. Normally I'd use fct_lump_n for this, but I can't figure out how to make it work within each group.
Here's an example, where I want to form groups based on the x variable, order the y variables based on the value of z, choose the first 3 y variables, and group the rest of y into "other":
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
I've tried doing this:
df %>%
group_by(x) %>%
arrange(desc(z), .by_group = T) %>%
slice_head(n = 3)
which returns this:
# A tibble: 9 x 3
# Groups: x [3]
x y z
<fct> <fct> <int>
1 r i 95
2 r c 92
3 r a 88
4 s g 94
5 s g 92
6 s f 92
7 t j 100
8 t d 93
9 t i 81
This is basically what I want, but I'm missing the 'other' variable within each of r, s, and t, which collects the values of z which have not been counted.
Can I use fct_lump_n for this? Or slice_head combined with grouping the excluded variables into "other"?
Tried in R 4.0.0 and tidyverse 1.3.0:
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
df %>%
group_by(x) %>%
arrange(desc(z)) %>%
mutate(a = row_number(-z)) %>%
mutate(y = case_when(a > 3 ~ "Other", TRUE ~ as.character(y))) %>%
mutate(a = case_when(a > 3 ~ "Other", TRUE ~ as.character(a))) %>%
group_by(x, y, a) %>%
summarize(z = sum(z)) %>%
arrange(x, a) %>%
select(-a)
Output:
# A tibble: 12 x 3
# Groups: x, y [11]
x y z
<fct> <chr> <int>
1 r b 92
2 r j 89
3 r g 83
4 r Other 749
5 s i 93
6 s h 93
7 s i 84
8 s Other 1583
9 t a 99
10 t b 98
11 t i 95
12 t Other 1508
Note: the use of variable a together with y is to compensate the fact that y is sampled with replacement (see row 5 and 7 of output). If I don't use a, row 5 and 7 of output will have their z summed up. Also note that I try to solve the problem posed, but I left y as character, since I suppose those "Other"s are not meant to be one same factor level.

Tidyverse Solution for Using Tibble Columns as Input to a Function

I am trying to run a function on all on combinations of two column vectors in a tibble.
library(tidyverse)
combination <- tibble(x = c(1, 2), y = c(3, 4))
sum_square <- function(x, y) {
x^2+y^2
}
I would like to run this function all combinations of column x and column y:
sum_square(1, 3)
sum_square(1, 4)
sum_square(2, 3)
sum_square(2, 4)
Ideally I would like a tidyverse solution.
We can first expand and then apply sum_square on the expanded dataset
library(tidyverse)
expand(combination, x, y) %>%
mutate(new = sum_square(x, y))
# A tibble: 4 x 3
# x y new
# <dbl> <dbl> <dbl>
#1 1 3 10
#2 1 4 17
#3 2 3 13
#4 2 4 20
Another option is outer
combination %>%
reduce(outer, FUN = sum_square) %>%
c %>%
tibble(new = .)

Remove duplicated rows using dplyr

I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr package.
Here is a solution using dplyr >= 0.5.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
Note: dplyr now contains the distinct function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)
I've also been thinking about adding a slice() function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique() that would let you select which
variables to use:
df %>% unique(x, y)
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct() from dplyr, as has already been suggested.
However, here's another approach that uses the slice() function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
Difference from using the distinct() function
The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange() function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
If you want to find the rows that are duplicated you can use find_duplicates from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()

Resources