creating dataframe of stats from another dataframe in R - r

I have the following code.
n_manu <- mpg %>% dplyr::select(manufacturer) %>% n_distinct()
n_model <- mpg %>% dplyr::select(model) %>% n_distinct()
n_year <- mpg %>% dplyr::select(year) %>% n_distinct()
I want to put this in a dataframe that looks like so:
Is there a way I can do this elegantly without 3 lines of code for calculating the distinct valuse?
stat value
n_manu 15
n_model 38
n_year 2

library(tidyverse)
mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct))
Gives
# A tibble: 1 × 3
manufacturer model year
<int> <int> <int>
1 15 38 2
and
mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct)) %>%
pivot_longer(everything(), names_to="stat")
# A tibble: 3 × 2
stat value
<chr> <int>
1 manufacturer 15
2 model 38
3 year 2
From there you can finesse "row labels" with ease.
To save the results as a dataframe, simply assign the result of the pipe to an object:
summaryStats <- mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct)) %>%
pivot_longer(everything(), names_to="stat")

Related

How do I create a function to mutate new columns with a variable name and "_pct"?

Using mtcars as an example. I would like to write a function that creates a count and pct column such as below -
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(cyl_pct = count/sum(count))
This produces the output -
# A tibble: 3 x 3
cyl count mpg_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438
However, I would like to create a function where I can specify the group_by column to be any column and the mutate column will be name the column name specified in the groub_by, and a _pct. So if I want to use disp, disp will be my group_by variable and the function will mutate a disp_pct column.
Similar to akrun's answer, but using {{ instead of !!:
foo = function(data, col) {
data %>%
group_by({{col}}) %>%
summarize(count = n()) %>%
ungroup %>%
mutate(
"{{col}}_pct" := count / sum(count)
)
}
foo(mtcars, cyl)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
# 1 4 11 0.344
# 2 6 7 0.219
# 3 8 14 0.438
Assuming that the input is unquoted, convert to symbol with ensym, evaluate (!!) within group_by while converting the symbol into a string (as_string) and paste the prefix '_pct' for the new column name. In mutate we can use := along with !! to assign the column name from the object created ('colnm')
library(stringr)
library(dplyr)
f1 <- function(dat, grp) {
grp <- ensym(grp)
colnm <- str_c(rlang::as_string(grp), '_pct')
dat %>%
group_by(!!grp) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(!! colnm := count/sum(count))
}
-testing
f1(mtcars, cyl)
# A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
#1 4 11 0.344
#2 6 7 0.219
#3 8 14 0.438
This is probably no different than the one posted by my dear friend #akrun. However, in my version I used enquo function instead of ensym.
There is actually a subtle difference between the two and I thought you might be interested to know:
As per documentation of nse-defuse, ensym returns a raw expression whereas enquo returns a "quosure" which is in fact a "wrapper containing an expression and an environment". So we need one extra step to access the expression of quosure made by enquo.
In this case we use get_expr for our purpose. So here is just another version of writing this function that I thought might be of interest to whomever read this post in the future.
library(dplyr)
library(rlang)
fn <- function(data, Var) {
Var <- enquo(Var)
colnm <- paste(get_expr(Var), "pct", sep = "_")
data %>%
group_by(!!Var) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(!! colnm := count/sum(count))
}
fn(mtcars, cyl)
# A tibble: 3 x 3
cyl count cyl_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438

Calculate proportions according to different groups [duplicate]

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?
library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)
# count frequency
mtcars %>%
group_by(am, gear) %>%
summarise(n = n())
# am gear n
# 0 3 15
# 0 4 4
# 1 4 8
# 1 5 5
What I would like to achieve:
am gear n rel.freq
0 3 15 0.7894737
0 4 4 0.2105263
1 4 8 0.6153846
1 5 5 0.3846154
Try this:
mtcars %>%
group_by(am, gear) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
From the dplyr vignette:
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.
Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.
The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.
For rounding and prettification, please refer to the nice answer by #Tyler Rinker.
You can use count() function, which has however a different behaviour depending on the version of dplyr:
dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations
dplyr 0.7.1
mtcars %>%
count(am, gear) %>%
group_by(am) %>%
mutate(freq = n / sum(n))
dplyr < 0.7.1
mtcars %>%
count(am, gear) %>%
mutate(freq = n / sum(n))
This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().
#Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
EDIT Because Spacedman asked for it :-)
as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
class(x) <- c("rel_freq", class(x))
attributes(x)[["rel_freq_col"]] <- rel_freq_col
x
}
print.rel_freq <- function(x, ...) {
freq_col <- attributes(x)[["rel_freq_col"]]
x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")
class(x) <- class(x)[!class(x)%in% "rel_freq"]
print(x)
}
mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
as.rel_freq()
## Source: local data frame [4 x 4]
## Groups: am
##
## am gear n rel.freq
## 1 0 3 15 79%
## 2 0 4 4 21%
## 3 1 4 8 62%
## 4 1 5 5 38%
Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.
library(dplyr)
mtcars %>%
group_by(am, gear) %>%
tally() %>%
mutate(freq = prop.table(n))
#> # A tibble: 4 × 4
#> # Groups: am [2]
#> am gear n freq
#> <dbl> <dbl> <int> <dbl>
#> 1 0 3 15 0.789
#> 2 0 4 4 0.211
#> 3 1 4 8 0.615
#> 4 1 5 5 0.385
library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]
#> am gear n freq
#> 1: 0 3 15 0.7894737
#> 2: 0 4 4 0.2105263
#> 3: 1 4 8 0.6153846
#> 4: 1 5 5 0.3846154
Created on 2022-10-22 with reprex v2.0.2
I wrote a small function for this repeating task:
count_pct <- function(df) {
return(
df %>%
tally %>%
mutate(n_pct = 100*n/sum(n))
)
}
I can then use it like:
mtcars %>%
group_by(cyl) %>%
count_pct
It returns:
# A tibble: 3 x 3
cyl n n_pct
<dbl> <int> <dbl>
1 4 11 34.4
2 6 7 21.9
3 8 14 43.8
For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.
With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.
library(dplyr)
library(scales)
original <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
original
#> # A tibble: 4 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 1 4 8 61.5%
#> 4 1 5 5 38.5%
new_drop_last <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop_last") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(original, new_drop_last)
#> [1] TRUE
With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by
# .groups = "drop"
new_drop <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "drop") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_drop
#> # A tibble: 4 x 4
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 46.9%
#> 2 0 4 4 12.5%
#> 3 1 4 8 25.0%
#> 4 1 5 5 15.6%
If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.
Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation
# .groups = "keep"
new_keep <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "keep") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
new_keep
#> # A tibble: 4 x 4
#> # Groups: am, gear [4]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 100.0%
#> 2 0 4 4 100.0%
#> 3 1 4 8 100.0%
#> 4 1 5 5 100.0%
# .groups = "rowwise"
new_rowwise <- mtcars %>%
group_by (am, gear) %>%
summarise (n=n(), .groups = "rowwise") %>%
mutate(rel.freq = scales::percent(n/sum(n), accuracy = 0.1))
dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE
Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.
# create a subtotal line to help readability
subtotal_am <- mtcars %>%
group_by (am) %>%
summarise (n=n()) %>%
mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)
mtcars %>% group_by (am, gear) %>%
summarise (n=n()) %>%
mutate(rel.freq = n/sum(n)) %>%
bind_rows(subtotal_am) %>%
arrange(am, gear) %>%
mutate(rel.freq = scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups: am [2]
#> am gear n rel.freq
#> <dbl> <dbl> <int> <chr>
#> 1 0 3 15 78.9%
#> 2 0 4 4 21.1%
#> 3 0 NA 19 100.0%
#> 4 1 4 8 61.5%
#> 5 1 5 5 38.5%
#> 6 1 NA 13 100.0%
Created on 2020-11-09 by the reprex package (v0.3.0)
Hope you find this answer useful.
Here is a general function implementing Henrik's solution on dplyr 0.7.1.
freq_table <- function(x,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
x %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n /sum(n)) %>%
ungroup
}
Also, try add_count() (to get around pesky group_by .groups).
mtcars %>%
count(am, gear) %>%
add_count(am, wt = n, name = "nn") %>%
mutate(proportion = n / nn)
Here is a base R answer using aggregate and ave :
df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1
# am gear n prop
#1 0 3 15 0.7894737
#2 0 4 4 0.2105263
#3 1 4 8 0.6153846
#4 1 5 5 0.3846154
We can also use prop.table but the output displays differently.
prop.table(table(mtcars$am, mtcars$gear), 1)
# 3 4 5
# 0 0.7894737 0.2105263 0.0000000
# 1 0.0000000 0.6153846 0.3846154
This answer is based upon Matifou's answer.
First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.
Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.
getOption("scipen")
options("scipen"=10)
mtcars %>%
count(am, gear) %>%
mutate(freq = (n / sum(n)) * 100)

How to calculate the sum of rows by each gvkey respectively?

I tried to calculate the cumulative sum of the twitter followers for each gvkey respectively ,and I use the group_by function,but the output is still the sum of the entire column,I suppose it is the problem of the " for (i in i:nrow(premod_e))
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%#arrange the gvkey and date
group_by(gvkey)#use group_by for respective calculation
for (i in 1:nrow(predmod_e)) {
predmod_e[i+1,]$x <- predmod_e[i+1,]$x + predmod_e[i,]$x
}#for loop to calculate
Perhaps just this:
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%
group_by(gvkey) %>%
mutate(newx = cumsum(x))
If you want to do something with the groups yourself (i.e., not with a dplyr verb), then you should use the groups as they are "known" by the tidy verbs. Luckily, they are merely stored as an attribute:
mtcars %>%
group_by(cyl) %>%
attr(., "groups")
# # A tibble: 3 x 2
# cyl .rows
# <dbl> <list>
# 1 4 <int [11]>
# 2 6 <int [7]>
# 3 8 <int [14]>

Define a quantile group in a dataframe with the data source in another dataframe in R

I have a quantile information from a dataframe in a named vector using the next code:
library(tidyverse)
quant_mpg <- mtcars %>%
pull(mpg) %>%
quantile(probs = seq(0, 1, 0.1))
And I want to cut this quantile in a summary dataframe created post:
grouped_mtcars <- mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup() %>%
mutate(quantile = cut(mpg, quant_mpg, labels = FALSE))
Obtaning the next output:
# A tibble: 3 x 3
cyl mpg quantile
<dbl> <dbl> <int>
1 4 26.7 9
2 6 19.7 6
3 8 15.1 2
Is there a way I can make this straightforward for the grouped variable without defining the quant_mpg vector. I need it this way bacause I have several group variables and grouped dataframes and I need to obtain the quantiles without much processing.
We can extract the column from original data
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mpg = mean(mpg)) %>%
mutate(quantile = cut(mpg, quantile(mtcars[['mpg']], #####
probs = seq(0, 1, 0.1)), labels = FALSE))
# A tibble: 3 x 3
# cyl mpg quantile
# <dbl> <dbl> <int>
#1 4 26.7 9
#2 6 19.7 6
#3 8 15.1 2

How do I create a variable with mutate() that has values equal to the label of a given variable?

I am trying to use mutate() on a data.frame that I have used gather() on to create a variable whose values are the label() for the gathered variable. I have searched Google and StackOverflow and have not found a suitable answer. My research led me to think that standard evaluation might be needed.
Here is a minimal reproducible example:
# Packages
library(dplyr)
library(Hmisc)
library(tidyr)
library(lazyeval)
df <- mtcars %>%
tbl_df() %>%
slice(1)
label(df$mpg) <- "Miles per gallon"
label(df$cyl) <- "Cylinders"
df %>%
select(mpg, cyl) %>%
gather(variable, value) %>%
mutate_(.dots = interp(~attr(df$x, "label"), x = variable))
This code produces:
# A tibble: 2 × 3
variable value `attr(df$mpg, "label")`
<chr> <dbl> <chr>
1 mpg 21 Miles per gallon
2 cyl 6 Miles per gallon
which is clearly only getting the label for mpg.
My goal is to have something like:
# A tibble: 2 × 3
variable value `attr(df$variable, "label")`
<chr> <dbl> <chr>
1 mpg 21 Miles per gallon
2 cyl 6 Cylinders
what about this?
df %>%
select(mpg, cyl) %>%
gather(variable, value) %>%
mutate(labels = label(df)[variable])
# A tibble: 2 × 3
variable value labels
<chr> <dbl> <chr>
1 mpg 21 Miles per gallon
2 cyl 6 Cylinders

Resources