For loop with dplyr package - r

I want to make this for loop for each colname in my dataframe but I have an error with group_by method :
Error in usemethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"
My code :
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by(i) %>%
summarise(value = n()) %>%
select(label = i, value)
print(distribution)
}
How can I fix this error?
Thanks for your help.

I offer a more tidy alternative that creates a frequency table by column and binds them in a single data frame.
library(dplyr)
library(purrr)
mtcars %>%
map(~table(.x)) %>%
lapply(as_tibble) %>%
bind_rows(.id = "var")
# # A tibble: 171 x 3
# var .x n
# <chr> <chr> <int>
# 1 mpg 10.4 2
# 2 mpg 13.3 1
# 3 mpg 14.3 1
# 4 mpg 14.7 1
# 5 mpg 15 1
# 6 mpg 15.2 2
# 7 mpg 15.5 1
# 8 mpg 15.8 1
# 9 mpg 16.4 1
# 10 mpg 17.3 1
# # ... with 161 more rows

If Iā€™m understanding your code correctly
You want to find out the unique items in each column in your data frame and print the table to the console
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by_at(.vars = i) %>%
summarise(value = n())
print(distribution)
}

Solution with base R.
for(i in creditDF) print(as.data.frame(table(i)))

Related

How do I create a function to mutate new columns with a variable name and "_pct"?

Using mtcars as an example. I would like to write a function that creates a count and pct column such as below -
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(cyl_pct = count/sum(count))
This produces the output -
# A tibble: 3 x 3
cyl count mpg_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438
However, I would like to create a function where I can specify the group_by column to be any column and the mutate column will be name the column name specified in the groub_by, and a _pct. So if I want to use disp, disp will be my group_by variable and the function will mutate a disp_pct column.
Similar to akrun's answer, but using {{ instead of !!:
foo = function(data, col) {
data %>%
group_by({{col}}) %>%
summarize(count = n()) %>%
ungroup %>%
mutate(
"{{col}}_pct" := count / sum(count)
)
}
foo(mtcars, cyl)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
# 1 4 11 0.344
# 2 6 7 0.219
# 3 8 14 0.438
Assuming that the input is unquoted, convert to symbol with ensym, evaluate (!!) within group_by while converting the symbol into a string (as_string) and paste the prefix '_pct' for the new column name. In mutate we can use := along with !! to assign the column name from the object created ('colnm')
library(stringr)
library(dplyr)
f1 <- function(dat, grp) {
grp <- ensym(grp)
colnm <- str_c(rlang::as_string(grp), '_pct')
dat %>%
group_by(!!grp) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(!! colnm := count/sum(count))
}
-testing
f1(mtcars, cyl)
# A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
#1 4 11 0.344
#2 6 7 0.219
#3 8 14 0.438
This is probably no different than the one posted by my dear friend #akrun. However, in my version I used enquo function instead of ensym.
There is actually a subtle difference between the two and I thought you might be interested to know:
As per documentation of nse-defuse, ensym returns a raw expression whereas enquo returns a "quosure" which is in fact a "wrapper containing an expression and an environment". So we need one extra step to access the expression of quosure made by enquo.
In this case we use get_expr for our purpose. So here is just another version of writing this function that I thought might be of interest to whomever read this post in the future.
library(dplyr)
library(rlang)
fn <- function(data, Var) {
Var <- enquo(Var)
colnm <- paste(get_expr(Var), "pct", sep = "_")
data %>%
group_by(!!Var) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(!! colnm := count/sum(count))
}
fn(mtcars, cyl)
# A tibble: 3 x 3
cyl count cyl_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

R - keep random rows per group, but different numbers per group

The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.
create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>

filter inside dplyr's summarise

I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type ā€”> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short ā€” % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]
Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA
I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)

use invoke_map to pass variable names as args

I would like to use invoke_map to call a list of functions. I have a set of variable names that I would like to use as arguments to each of the functions. Ultimately the variable names will used with group_by.
Here's an example:
library(dplyr)
library(purrr)
first_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
count()
}
second_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
summarise(avg_wt = mean(wt))
}
first_fun(mpg, cyl) # works
second_fun(mpg, cyl) # works
both_funs <- list(first_fun, second_fun)
both_funs %>%
invoke_map(mpg, cyl) # What do I do here?
I have tried various attempts to put the variable names in quotes, enquo them, use vars, reference .data$mpg, etc, but I am stabbing in the dark a bit.
The issue is not that you're using dots, it's that you're using names and when map2_impl is called these arguments are evaluated.
Try this and explore the environment:
debugonce(map2)
both_funs %>% invoke_map("mpg", "cyl")
This works on the other hand:
first_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
count()
}
second_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
summarise(avg_wt = mean(wt))
}
both_funs2 <- list(first_fun2, second_fun2)
both_funs2 %>% invoke_map("mpg", "cyl")
# [[1]]
# # A tibble: 25 x 2
# # Groups: mpg [25]
# mpg n
# <dbl> <int>
# 1 10.4 2
# 2 13.3 1
# 3 14.3 1
# 4 14.7 1
# 5 15.0 1
# 6 15.2 2
# 7 15.5 1
# 8 15.8 1
# 9 16.4 1
# 10 17.3 1
# # ... with 15 more rows
#
# [[2]]
# # A tibble: 25 x 2
# mpg avg_wt
# <dbl> <dbl>
# 1 10.4 5.3370
# 2 13.3 3.8400
# 3 14.3 3.5700
# 4 14.7 5.3450
# 5 15.0 3.5700
# 6 15.2 3.6075
# 7 15.5 3.5200
# 8 15.8 3.1700
# 9 16.4 4.0700
# 10 17.3 3.7300
# # ... with 15 more rows

Resources