I want to make this for loop for each colname in my dataframe but I have an error with group_by method :
Error in usemethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"
My code :
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by(i) %>%
summarise(value = n()) %>%
select(label = i, value)
print(distribution)
}
How can I fix this error?
Thanks for your help.
I offer a more tidy alternative that creates a frequency table by column and binds them in a single data frame.
library(dplyr)
library(purrr)
mtcars %>%
map(~table(.x)) %>%
lapply(as_tibble) %>%
bind_rows(.id = "var")
# # A tibble: 171 x 3
# var .x n
# <chr> <chr> <int>
# 1 mpg 10.4 2
# 2 mpg 13.3 1
# 3 mpg 14.3 1
# 4 mpg 14.7 1
# 5 mpg 15 1
# 6 mpg 15.2 2
# 7 mpg 15.5 1
# 8 mpg 15.8 1
# 9 mpg 16.4 1
# 10 mpg 17.3 1
# # ... with 161 more rows
If Iām understanding your code correctly
You want to find out the unique items in each column in your data frame and print the table to the console
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by_at(.vars = i) %>%
summarise(value = n())
print(distribution)
}
Solution with base R.
for(i in creditDF) print(as.data.frame(table(i)))
Related
Using mtcars as an example. I would like to write a function that creates a count and pct column such as below -
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(cyl_pct = count/sum(count))
This produces the output -
# A tibble: 3 x 3
cyl count mpg_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438
However, I would like to create a function where I can specify the group_by column to be any column and the mutate column will be name the column name specified in the groub_by, and a _pct. So if I want to use disp, disp will be my group_by variable and the function will mutate a disp_pct column.
Similar to akrun's answer, but using {{ instead of !!:
foo = function(data, col) {
data %>%
group_by({{col}}) %>%
summarize(count = n()) %>%
ungroup %>%
mutate(
"{{col}}_pct" := count / sum(count)
)
}
foo(mtcars, cyl)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
# 1 4 11 0.344
# 2 6 7 0.219
# 3 8 14 0.438
Assuming that the input is unquoted, convert to symbol with ensym, evaluate (!!) within group_by while converting the symbol into a string (as_string) and paste the prefix '_pct' for the new column name. In mutate we can use := along with !! to assign the column name from the object created ('colnm')
library(stringr)
library(dplyr)
f1 <- function(dat, grp) {
grp <- ensym(grp)
colnm <- str_c(rlang::as_string(grp), '_pct')
dat %>%
group_by(!!grp) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(!! colnm := count/sum(count))
}
-testing
f1(mtcars, cyl)
# A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
#1 4 11 0.344
#2 6 7 0.219
#3 8 14 0.438
This is probably no different than the one posted by my dear friend #akrun. However, in my version I used enquo function instead of ensym.
There is actually a subtle difference between the two and I thought you might be interested to know:
As per documentation of nse-defuse, ensym returns a raw expression whereas enquo returns a "quosure" which is in fact a "wrapper containing an expression and an environment". So we need one extra step to access the expression of quosure made by enquo.
In this case we use get_expr for our purpose. So here is just another version of writing this function that I thought might be of interest to whomever read this post in the future.
library(dplyr)
library(rlang)
fn <- function(data, Var) {
Var <- enquo(Var)
colnm <- paste(get_expr(Var), "pct", sep = "_")
data %>%
group_by(!!Var) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(!! colnm := count/sum(count))
}
fn(mtcars, cyl)
# A tibble: 3 x 3
cyl count cyl_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438
Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)
The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.
create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>
I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type ā> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short ā % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]
Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA
I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)
I would like to use invoke_map to call a list of functions. I have a set of variable names that I would like to use as arguments to each of the functions. Ultimately the variable names will used with group_by.
Here's an example:
library(dplyr)
library(purrr)
first_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
count()
}
second_fun <- function(...){
by_group = quos(...)
mtcars %>%
group_by(!!!by_group) %>%
summarise(avg_wt = mean(wt))
}
first_fun(mpg, cyl) # works
second_fun(mpg, cyl) # works
both_funs <- list(first_fun, second_fun)
both_funs %>%
invoke_map(mpg, cyl) # What do I do here?
I have tried various attempts to put the variable names in quotes, enquo them, use vars, reference .data$mpg, etc, but I am stabbing in the dark a bit.
The issue is not that you're using dots, it's that you're using names and when map2_impl is called these arguments are evaluated.
Try this and explore the environment:
debugonce(map2)
both_funs %>% invoke_map("mpg", "cyl")
This works on the other hand:
first_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
count()
}
second_fun2 <- function(...){
mtcars %>%
{do.call(group_by_,list(.,unlist(list(...))))} %>%
summarise(avg_wt = mean(wt))
}
both_funs2 <- list(first_fun2, second_fun2)
both_funs2 %>% invoke_map("mpg", "cyl")
# [[1]]
# # A tibble: 25 x 2
# # Groups: mpg [25]
# mpg n
# <dbl> <int>
# 1 10.4 2
# 2 13.3 1
# 3 14.3 1
# 4 14.7 1
# 5 15.0 1
# 6 15.2 2
# 7 15.5 1
# 8 15.8 1
# 9 16.4 1
# 10 17.3 1
# # ... with 15 more rows
#
# [[2]]
# # A tibble: 25 x 2
# mpg avg_wt
# <dbl> <dbl>
# 1 10.4 5.3370
# 2 13.3 3.8400
# 3 14.3 3.5700
# 4 14.7 5.3450
# 5 15.0 3.5700
# 6 15.2 3.6075
# 7 15.5 3.5200
# 8 15.8 3.1700
# 9 16.4 4.0700
# 10 17.3 3.7300
# # ... with 15 more rows