I wonder how to modify below code
xxx<-function(df,groupbys){
groupbys<-enquo(groupbys)
df%>%group_by_(groupbys)%>%summarise(count=n())
}
zzz<-xxx(iris,Species)
to have the option to feed in either one column or more than one column to group by? For example, goup_by_ both Speciesand Petal.Length with iris dataset.
When using enquo (single argument) or enquos (multiple), you should use the !! and !!! operators, respectively.
xxx <- function(df, ...) {
grps <- enquos(...)
df %>%
group_by(!!!grps) %>%
tally() %>%
ungroup()
}
mtcars %>% xxx(cyl, am)
# # A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
# 1 4 0 3
# 2 4 1 8
# 3 6 0 4
# 4 6 1 3
# 5 8 0 12
# 6 8 1 2
or if you want to keep a single argument in the function formals for one or more column names, I think you'll need to use vars() in the call. (Perhaps there's another way suggested in the Programming with dplyr vignette.)
xxx <- function(df, groups) {
df %>%
group_by(!!!groups) %>%
tally() %>%
ungroup()
}
xxx(mtcars, vars(cyl, am))
This is a point whereby you just need to use the .dots argument in the groupby function. Just ensure the groupbys is a character. ie
xxx<-function(df,groupbys){
df%>%group_by(.dots = groupbys)%>%summarise(count=n())
}
xxx(iris,"Species")
# A tibble: 3 x 2
Species count
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
xxx(iris,c("Species","Petal.Length"))
# A tibble: 48 x 3
# Groups: Species [3]
Species Petal.Length count
<fct> <dbl> <int>
1 setosa 1 1
2 setosa 1.1 1
3 setosa 1.2 2
4 setosa 1.3 7
5 setosa 1.4 13
6 setosa 1.5 13
7 setosa 1.6 7
8 setosa 1.7 4
9 setosa 1.9 2
10 versicolor 3 1
Here are two approaches to the problem. If you want to pass column name as unquoted variables, you can use ... and use it in count instead of group_by + summarise.
xxx<-function(df,...){
df %>% count(...)
}
xxx(mtcars, cyl)
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
xxx(mtcars, cyl, am)
# A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
#1 4 0 3
#2 4 1 8
#3 6 0 4
#4 6 1 3
#5 8 0 12
#6 8 1 2
Second approach if you want to pass column name as quoted variable (strings), you can use group_by_at which accepts string inputs.
xxx<-function(df,groupbys){
df %>% group_by_at(groupbys) %>% summarise(n = n())
}
xxx(mtcars, c("cyl", "am"))
Related
library(tidyverse)
mean_by <- function(data,by,conti){
data %>% group_by({{by}}) %>% summarise(mean=mean({{conti}})) %>%
print() %>%
ggplot(aes(x={{by}},y=mean))+geom_col()
}
map(mtcars %>% select_if(is.numeric),~mean_by(mtcars,cyl,.))
# Not quite the same
mean_by(mtcars,cyl,carb)
I was toying around with the curly curly operator in R (just learned about it!) and then when iterating using map it seemd like the grouping isnt working very well, and I cant get my hands around the problem. What am I doing wrong?
Btw, When trying the explicit pmap way, I couldnt get around using the cyl variable in a clever way
pmap(mtcars %>% select_if(is.numeric),mean_by,..1=mtcars,..2=cyl,..3=.)
Error in pmap():
i In index: 1.
Caused by error in withCallingHandlers():
! object 'cyl' not found
Run rlang::last_error() to see where the error occurred.
It is expecting the column names and not the values - here, the select_if returns a subset of columns that are numeric. We may need the names to loop which would be a string, thus it is better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
mean_by <- function(data,by,conti){
by_sym <- rlang::ensym(by)
conti <- rlang::ensym(conti)
data %>% group_by(!! by_sym) %>%
summarise(mean=mean(!!conti)) %>%
print() %>%
ggplot(aes(x= !!by_sym,y=mean))+geom_col()
}
map(mtcars %>%
select_if(is.numeric) %>%
names,~mean_by(mtcars,cyl, !!.x))
-output (graphs removed)
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4
2 6 6
3 8 8
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 82.6
2 6 122.
3 8 209.
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4.07
2 6 3.59
3 8 3.23
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 2.29
2 6 3.12
3 8 4.00
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 19.1
2 6 18.0
3 8 16.8
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 0.909
2 6 0.571
3 8 0
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 0.727
2 6 0.429
3 8 0.143
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4.09
2 6 3.86
3 8 3.29
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 1.55
2 6 3.43
3 8 3.5
I've not seen the tilde syntax with map, but if you change that it seems to work.
map(mtcars %>% select_if(is.numeric), mean_by, data=mtcars, by=cyl)
Side note, you don't need that print() statement in mean_by.
mean_by <- function(data,by,conti){
data %>% group_by({{by}}) %>% summarise(mean=mean({{conti}})) %>%
ggplot(aes(x={{by}},y=mean))+geom_col()
}
In R , is there any available function like IFERROR formula in EXCEL ?
I want to calculate moving average using 4 nearest figures, but if the figures less than 4 in the group then using normal average.
Detail refer to below code, the IF_ERROR is just i wished function and can't work
library(tidyverse)
library(TTR)
test_data <- data.frame(category=c('a','a','a','b','b','b','b','b','b'),
amount=c(1,2,3,4,5,6,7,8,9))
test_data %>% group_by(category) %>% mutate(avg_amount=IF_ERROR(TTR::runMedian(amount,4),
median(amount),
TTR::runMedian(amount,4))
In general, input should only generate errors in exceptional circumstances. It can be computationally expensive to catch and handle errors where a simple if statement will suffice. The key here is realising that runMedian throws an error if the group size is less than 4. Remember we can check the group size inside mutate by using n(), so all you need do is:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 NA
#> 5 b 5 NA
#> 6 b 6 NA
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
Additionally, if you want to replace the NA values from the beginning of the running median, you could use ifelse:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount),
avg_amount = ifelse(is.na(avg_amount), median(amount), avg_amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 6.5
#> 5 b 5 6.5
#> 6 b 6 6.5
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
I'm computing the frequency by group under dplyr. But the output is not automatically saved as a dataframe and only shows the first 10 rows. Does anyone know how to do that? I need to use all rows of data for further analyses. THANKS!
library(dplyr)
data01 %>%
group_by(Country, relsta) %>%
summarize(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
Output
Country relsta Freq married
<int> <chr> <int> <dbl>
1 1 1 15 0.176
2 1 3 1 0.0118
3 1 4 28 0.329
4 1 5 6 0.0706
5 1 6 22 0.259
6 1 7 1 0.0118
7 1 99 12 0.141
8 2 NA 273 1
9 3 NA 129 1
10 4 2 9 0.0796
# ... with 115 more rows
dplyr throws tibbles, the output is just hidden from you. Here an example using iris
library(dplyr)
res1 <- iris %>%
group_by(Sepal.Length, Species) %>%
summarize(Freq=n()) %>%
mutate(foo = Freq/sum(Freq))
res1
# Sepal.Length Species Freq foo
# <dbl> <fct> <int> <dbl>
# 1 4.3 setosa 1 1
# 2 4.4 setosa 3 1
# 3 4.5 setosa 1 1
# 4 4.6 setosa 4 1
# 5 4.7 setosa 2 1
# 6 4.8 setosa 5 1
# 7 4.9 setosa 4 0.667
# 8 4.9 versicolor 1 0.167
# 9 4.9 virginica 1 0.167
# 10 5 setosa 8 0.8
# # … with 47 more rows
Notice the … with 47 more rows. You may also check the dimensions:
dim(res1)
# [1] 57 4
Also,
class(res1)
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
whereas:
class(iris)
# [1] "data.frame"
To see more data, use as.data.frame(). If the data is too large, rows also get omitted. You may customize that with e.g. options(max.print=3000) where default is 1000.
as.data.frame(res1)
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 55 7.6 virginica 1 1.0000000
# 56 7.7 virginica 4 1.0000000
# 57 7.9 virginica 1 1.0000000
You could also consider using base R. Since following line already gives you the "Freq" column,
as.data.frame.table(with(iris, table(Sepal.Length, Species)))
you could do this:
res2 <- with(iris, table(Sepal.Length, Species)) |>
as.data.frame.table() |>
transform(foo=ave(Freq, Sepal.Length, FUN=\(x) x/sum(x))) |>
subset(Freq > 0)
res2
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 103 7.6 virginica 1 1.0000000
# 104 7.7 virginica 4 1.0000000
# 105 7.9 virginica 1 1.0000000
Where:
dim(res2)
# [1] 57 4
class(res2)
# [1] "data.frame"
Note: R >= 4.1 used
The summarize function always returns just one row per group. mutate will keep all the rows here. Try:
library(dplyr)
data02 = data01 %>%
group_by(Country, relsta) %>%
mutate(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
I'm trying to get the total number of entries of each row in a dataframe in order to compress on these fields later.
However the dataframe has over 60 rows and writing the below 60 times is extremely inefficient
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
Is there a way I can write a for loop to loop through all the names in the dataframe and produce the pipe function result for each? I tried
for (i in colnames(df)) {
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
}
But I'm getting an 'i is unknown' error. Any help would be appreciated thanks.
If I understand correctly you want to count the number of occurrences of the unique elements in every single column or did I get that completely wrong? Why are you not just using a combination of some apply function and table?
set.seed(101)
df <- data.frame("x" = 1:20, "y" = LETTERS[sample(1:26, 20, replace = TRUE)], "z" = letters[sample(1:26, 20, replace = TRUE)])
l <- sapply(df, table)
lapply(l, sort, decreasing = T)
You can try this:
#Data
df <- iris
#Create list
List <- list()
#Compute
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count() %>%
arrange(desc(n))
}
#Print
List
$Sepal.Length
# A tibble: 35 x 2
# Groups: df[, colname] [35]
`df[, colname]` n
<dbl> <int>
1 5 10
2 5.1 9
3 6.3 9
4 5.7 8
5 6.7 8
6 5.5 7
7 5.8 7
8 6.4 7
9 4.9 6
10 5.4 6
# ... with 25 more rows
$Sepal.Width
# A tibble: 23 x 2
# Groups: df[, colname] [23]
`df[, colname]` n
<dbl> <int>
1 3 26
2 2.8 14
3 3.2 13
4 3.4 12
5 3.1 11
6 2.9 10
7 2.7 9
8 2.5 8
9 3.3 6
10 3.5 6
# ... with 13 more rows
$Petal.Length
# A tibble: 43 x 2
# Groups: df[, colname] [43]
`df[, colname]` n
<dbl> <int>
1 1.4 13
2 1.5 13
3 4.5 8
4 5.1 8
5 1.3 7
6 1.6 7
7 5.6 6
8 4 5
9 4.7 5
10 4.9 5
# ... with 33 more rows
$Petal.Width
# A tibble: 22 x 2
# Groups: df[, colname] [22]
`df[, colname]` n
<dbl> <int>
1 0.2 29
2 1.3 13
3 1.5 12
4 1.8 12
5 1.4 8
6 2.3 8
7 0.3 7
8 0.4 7
9 1 7
10 2 6
# ... with 12 more rows
$Species
# A tibble: 3 x 2
# Groups: df[, colname] [3]
`df[, colname]` n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
I want to create a function that takes a grouping argument. Which can be a single or multiple variables. I want it to look like this:
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
This work only when a single group is given but breaks when there are multiple groups. I know it's possible to use the following with ellipsis ... (But I want the syntax groups = something):
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
Here is the entire code:
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
# works
wanted(iris, groups = Species )
not_wanted(iris, Species, group2)
# doesn't work
wanted(iris, groups = vars(Species, group2) )
wanted(iris, groups = c(Species, group2) )
wanted(iris, groups = vars("Species", "group2") )
# Error: Column `vars(Species, group2)` must be length 150 (the number of rows) or one, not 2
You guys are over complicating things, this works just fine:
library(tidyverse)
wanted <- function(data, groups){
data %>% count(!!!groups)
}
mtcars %>% wanted(groups = vars(mpg,disp,hp))
# A tibble: 31 x 4
mpg disp hp n
<dbl> <dbl> <dbl> <int>
1 10.4 460 215 1
2 10.4 472 205 1
3 13.3 350 245 1
4 14.3 360 245 1
5 14.7 440 230 1
6 15 301 335 1
7 15.2 276. 180 1
8 15.2 304 150 1
9 15.5 318 150 1
10 15.8 351 264 1
# … with 21 more rows
The triple bang operator and parse_quos from the rlang package will do the trick. For more info, see e.g. https://stackoverflow.com/a/49941635/6086135
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
vec <- c("Species", "group2")
wanted <- function(data, groups){
data %>% count(!!!rlang::parse_quos(groups, rlang::current_env()))
}
wanted(iris, vec)
#> # A tibble: 15 x 3
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10
Created on 2020-01-06 by the reprex package (v0.3.0)
Here is another option to avoid quotations in the function call. I admit its not very pretty though.
library(tidyverse)
wanted <- function(data, groups){
grouping <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
data %>% group_by_at(grouping) %>% count()
}
iris$group2 <- rep(1:5, 30)
wanted(iris, groups = c(Species, group2) )
#> # A tibble: 15 x 3
#> # Groups: Species, group2 [15]
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10