For loop to iterate through dplyr pipe - r

I'm trying to get the total number of entries of each row in a dataframe in order to compress on these fields later.
However the dataframe has over 60 rows and writing the below 60 times is extremely inefficient
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
Is there a way I can write a for loop to loop through all the names in the dataframe and produce the pipe function result for each? I tried
for (i in colnames(df)) {
df %>%
group_by(colname) %>%
count() %>%
arrange(desc(n))
}
But I'm getting an 'i is unknown' error. Any help would be appreciated thanks.

If I understand correctly you want to count the number of occurrences of the unique elements in every single column or did I get that completely wrong? Why are you not just using a combination of some apply function and table?
set.seed(101)
df <- data.frame("x" = 1:20, "y" = LETTERS[sample(1:26, 20, replace = TRUE)], "z" = letters[sample(1:26, 20, replace = TRUE)])
l <- sapply(df, table)
lapply(l, sort, decreasing = T)

You can try this:
#Data
df <- iris
#Create list
List <- list()
#Compute
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count() %>%
arrange(desc(n))
}
#Print
List
$Sepal.Length
# A tibble: 35 x 2
# Groups: df[, colname] [35]
`df[, colname]` n
<dbl> <int>
1 5 10
2 5.1 9
3 6.3 9
4 5.7 8
5 6.7 8
6 5.5 7
7 5.8 7
8 6.4 7
9 4.9 6
10 5.4 6
# ... with 25 more rows
$Sepal.Width
# A tibble: 23 x 2
# Groups: df[, colname] [23]
`df[, colname]` n
<dbl> <int>
1 3 26
2 2.8 14
3 3.2 13
4 3.4 12
5 3.1 11
6 2.9 10
7 2.7 9
8 2.5 8
9 3.3 6
10 3.5 6
# ... with 13 more rows
$Petal.Length
# A tibble: 43 x 2
# Groups: df[, colname] [43]
`df[, colname]` n
<dbl> <int>
1 1.4 13
2 1.5 13
3 4.5 8
4 5.1 8
5 1.3 7
6 1.6 7
7 5.6 6
8 4 5
9 4.7 5
10 4.9 5
# ... with 33 more rows
$Petal.Width
# A tibble: 22 x 2
# Groups: df[, colname] [22]
`df[, colname]` n
<dbl> <int>
1 0.2 29
2 1.3 13
3 1.5 12
4 1.8 12
5 1.4 8
6 2.3 8
7 0.3 7
8 0.4 7
9 1 7
10 2 6
# ... with 12 more rows
$Species
# A tibble: 3 x 2
# Groups: df[, colname] [3]
`df[, colname]` n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50

Related

Adding unique ID column associated to two groups R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 7 months ago.
I have a data frame in this format:
Group
Observation
a
1
a
2
a
3
b
4
b
5
c
6
c
7
c
8
I want to create a unique ID column which considers both group and each unique observation within it, so that it is formatted like so:
Group
Observation
Unique_ID
a
1
1.1
a
2
1.2
a
3
1.3
b
4
2.1
b
5
2.2
c
6
3.1
c
7
3.2
c
8
3.3
Does anyone know of any syntax or functions to accomplish this? The formatting does not need to exactly match '1.1' as long as it signifies group and each unique observation within it. Thanks in advance
Another way using cur_group_id and row_number
library(dplyr)
A <- 'Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8'
df <- read.table(textConnection(A), header = TRUE)
df |>
group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(), ".", row_number())) |>
ungroup()
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3
library(tidyverse)
df <- read_table("Group Observation
a 1
a 2
a 3
b 4
b 5
c 6
c 7
c 8")
df %>%
mutate(unique = Group %>%
as.factor() %>%
as.integer() %>%
paste(., Observation, sep = "."))
#> # A tibble: 8 x 3
#> Group Observation unique
#> <chr> <dbl> <chr>
#> 1 a 1 1.1
#> 2 a 2 1.2
#> 3 a 3 1.3
#> 4 b 4 2.4
#> 5 b 5 2.5
#> 6 c 6 3.6
#> 7 c 7 3.7
#> 8 c 8 3.8
Created on 2022-07-12 by the reprex package (v2.0.1)
Try this
df |> group_by(Group) |>
mutate(Unique_ID = paste0(cur_group_id(),".",1:n()))
output
# A tibble: 8 × 3
# Groups: Group [3]
Group Observation Unique_ID
<chr> <int> <chr>
1 a 1 1.1
2 a 2 1.2
3 a 3 1.3
4 b 4 2.1
5 b 5 2.2
6 c 6 3.1
7 c 7 3.2
8 c 8 3.3

save mutate output under dplyr

I'm computing the frequency by group under dplyr. But the output is not automatically saved as a dataframe and only shows the first 10 rows. Does anyone know how to do that? I need to use all rows of data for further analyses. THANKS!
library(dplyr)
data01 %>%
group_by(Country, relsta) %>%
summarize(Freq=n()) %>%
mutate (married = Freq/sum(Freq))
Output
Country relsta Freq married
<int> <chr> <int> <dbl>
1 1 1 15 0.176
2 1 3 1 0.0118
3 1 4 28 0.329
4 1 5 6 0.0706
5 1 6 22 0.259
6 1 7 1 0.0118
7 1 99 12 0.141
8 2 NA 273 1
9 3 NA 129 1
10 4 2 9 0.0796
# ... with 115 more rows
dplyr throws tibbles, the output is just hidden from you. Here an example using iris
library(dplyr)
res1 <- iris %>%
group_by(Sepal.Length, Species) %>%
summarize(Freq=n()) %>%
mutate(foo = Freq/sum(Freq))
res1
# Sepal.Length Species Freq foo
# <dbl> <fct> <int> <dbl>
# 1 4.3 setosa 1 1
# 2 4.4 setosa 3 1
# 3 4.5 setosa 1 1
# 4 4.6 setosa 4 1
# 5 4.7 setosa 2 1
# 6 4.8 setosa 5 1
# 7 4.9 setosa 4 0.667
# 8 4.9 versicolor 1 0.167
# 9 4.9 virginica 1 0.167
# 10 5 setosa 8 0.8
# # … with 47 more rows
Notice the … with 47 more rows. You may also check the dimensions:
dim(res1)
# [1] 57 4
Also,
class(res1)
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
whereas:
class(iris)
# [1] "data.frame"
To see more data, use as.data.frame(). If the data is too large, rows also get omitted. You may customize that with e.g. options(max.print=3000) where default is 1000.
as.data.frame(res1)
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 55 7.6 virginica 1 1.0000000
# 56 7.7 virginica 4 1.0000000
# 57 7.9 virginica 1 1.0000000
You could also consider using base R. Since following line already gives you the "Freq" column,
as.data.frame.table(with(iris, table(Sepal.Length, Species)))
you could do this:
res2 <- with(iris, table(Sepal.Length, Species)) |>
as.data.frame.table() |>
transform(foo=ave(Freq, Sepal.Length, FUN=\(x) x/sum(x))) |>
subset(Freq > 0)
res2
# Sepal.Length Species Freq foo
# 1 4.3 setosa 1 1.0000000
# 2 4.4 setosa 3 1.0000000
# 3 4.5 setosa 1 1.0000000
# [...]
# 103 7.6 virginica 1 1.0000000
# 104 7.7 virginica 4 1.0000000
# 105 7.9 virginica 1 1.0000000
Where:
dim(res2)
# [1] 57 4
class(res2)
# [1] "data.frame"
Note: R >= 4.1 used
The summarize function always returns just one row per group. mutate will keep all the rows here. Try:
library(dplyr)
data02 = data01 %>%
group_by(Country, relsta) %>%
mutate(Freq=n()) %>%
mutate (married = Freq/sum(Freq))

calculate grand mean from means in r

I am trying to aggregate a grand mean from mean scores for students. Here is how my dataset looks like:
id <- c(1,1,1, 2,2,2, 3,3, 4,4,4)
mean <- c(5,5,5, 6,6,6, 7,7, 8,8,8)
data <- data.frame(id,mean)
> data
id mean
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6
7 3 7
8 3 7
9 4 8
10 4 8
11 4 8
I am using dplyr package for this calculation. I use this,
data %>%
mutate(grand.mean = mean(mean))
id mean grand.mean
1 1 5 6.454545
2 1 5 6.454545
3 1 5 6.454545
4 2 6 6.454545
5 2 6 6.454545
6 2 6 6.454545
7 3 7 6.454545
8 3 7 6.454545
9 4 8 6.454545
10 4 8 6.454545
11 4 8 6.454545
However, this does not consider repeated means for each id. The calculation should be grabbing unique means from each id and average them over.
so it is (5+6+7+8)/4 = 6.5 instead of 6.45.
Any ideas?
Thanks!
If there are duplicates for mean in different 'id', use match to get the position of the first 'id' and get the mean of the 'mean' column
library(dplyr)
data %>%
mutate(grand.mean = mean(mean[match(unique(id), id)]))
# id mean grand.mean
#1 1 5 6.5
#2 1 5 6.5
#3 1 5 6.5
#4 2 6 6.5
#5 2 6 6.5
#6 2 6 6.5
#7 3 7 6.5
#8 3 7 6.5
#9 4 8 6.5
#10 4 8 6.5
#11 4 8 6.5
Or another option is duplicated
data %>%
mutate(grand.mean = mean(mean[!duplicated(id)]))
Or take the distinct rows. of 'id', 'mean', get the mean, and bind the columns with original dataset
library(tidyr)
data %>%
distinct(id, mean) %>%
summarise(grand.mean = mean(mean)) %>%
uncount(nrow(data)) %>%
bind_cols(data, .)
A base R one-liner could be:
mean(tapply(data$mean, data$id, '[', 1))
#[1] 6.5
To put the result in the original data set do
data$grand.mean <- mean(tapply(data$mean, data$id, '[', 1))
You can use unique and than caluculate mean to get a grand mean.
mean(unique(data)[,"mean"])
#[1] 6.5
Or you can aggregate by id and then caluculate mean to get a grand mean.
mean(aggregate(mean~id, data, base::mean)[,"mean"])
#[1] 6.5
Or use ave to get the number repeated values per id and use this as a weight in weighted.mean.
weighted.mean(mean, 1/ave(id, id, FUN=length))
#[1] 6.5
If you only need a single answer for the grand mean, just use two 'summarise' steps with 'dplyr':
library(dplyr)
data %>%
group_by(id) %>%
summarise(mean = mean(mean)) %>%
summarise(grand.mean = mean(mean))
Result:
grand.mean
<dbl>
1 6.5
Using dplyr, we can group_by id and get the mean of unique mean values in each id, then get the grand_mean of the entire dataset and do a right_join with the original data to add grand_mean as a new column.
library(dplyr)
data %>%
group_by(id) %>%
summarise(grand_mean = mean(unique(mean))) %>%
mutate(grand_mean = mean(grand_mean)) %>%
right_join(data, by = 'id')
# A tibble: 11 x 3
# id grand_mean mean
# <dbl> <dbl> <dbl>
# 1 1 6.5 5
# 2 1 6.5 5
# 3 1 6.5 5
# 4 2 6.5 6
# 5 2 6.5 6
# 6 2 6.5 6
# 7 3 6.5 7
# 8 3 6.5 7
# 9 4 6.5 8
#10 4 6.5 8
#11 4 6.5 8

How to group by multiple values in a function with dplyr

I wonder how to modify below code
xxx<-function(df,groupbys){
groupbys<-enquo(groupbys)
df%>%group_by_(groupbys)%>%summarise(count=n())
}
zzz<-xxx(iris,Species)
to have the option to feed in either one column or more than one column to group by? For example, goup_by_ both Speciesand Petal.Length with iris dataset.
When using enquo (single argument) or enquos (multiple), you should use the !! and !!! operators, respectively.
xxx <- function(df, ...) {
grps <- enquos(...)
df %>%
group_by(!!!grps) %>%
tally() %>%
ungroup()
}
mtcars %>% xxx(cyl, am)
# # A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
# 1 4 0 3
# 2 4 1 8
# 3 6 0 4
# 4 6 1 3
# 5 8 0 12
# 6 8 1 2
or if you want to keep a single argument in the function formals for one or more column names, I think you'll need to use vars() in the call. (Perhaps there's another way suggested in the Programming with dplyr vignette.)
xxx <- function(df, groups) {
df %>%
group_by(!!!groups) %>%
tally() %>%
ungroup()
}
xxx(mtcars, vars(cyl, am))
This is a point whereby you just need to use the .dots argument in the groupby function. Just ensure the groupbys is a character. ie
xxx<-function(df,groupbys){
df%>%group_by(.dots = groupbys)%>%summarise(count=n())
}
xxx(iris,"Species")
# A tibble: 3 x 2
Species count
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
xxx(iris,c("Species","Petal.Length"))
# A tibble: 48 x 3
# Groups: Species [3]
Species Petal.Length count
<fct> <dbl> <int>
1 setosa 1 1
2 setosa 1.1 1
3 setosa 1.2 2
4 setosa 1.3 7
5 setosa 1.4 13
6 setosa 1.5 13
7 setosa 1.6 7
8 setosa 1.7 4
9 setosa 1.9 2
10 versicolor 3 1
Here are two approaches to the problem. If you want to pass column name as unquoted variables, you can use ... and use it in count instead of group_by + summarise.
xxx<-function(df,...){
df %>% count(...)
}
xxx(mtcars, cyl)
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
xxx(mtcars, cyl, am)
# A tibble: 6 x 3
# cyl am n
# <dbl> <dbl> <int>
#1 4 0 3
#2 4 1 8
#3 6 0 4
#4 6 1 3
#5 8 0 12
#6 8 1 2
Second approach if you want to pass column name as quoted variable (strings), you can use group_by_at which accepts string inputs.
xxx<-function(df,groupbys){
df %>% group_by_at(groupbys) %>% summarise(n = n())
}
xxx(mtcars, c("cyl", "am"))

How to avoid ellipsis ... in dplyr?

I want to create a function that takes a grouping argument. Which can be a single or multiple variables. I want it to look like this:
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
This work only when a single group is given but breaks when there are multiple groups. I know it's possible to use the following with ellipsis ... (But I want the syntax groups = something):
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
Here is the entire code:
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
# works
wanted(iris, groups = Species )
not_wanted(iris, Species, group2)
# doesn't work
wanted(iris, groups = vars(Species, group2) )
wanted(iris, groups = c(Species, group2) )
wanted(iris, groups = vars("Species", "group2") )
# Error: Column `vars(Species, group2)` must be length 150 (the number of rows) or one, not 2
You guys are over complicating things, this works just fine:
library(tidyverse)
wanted <- function(data, groups){
data %>% count(!!!groups)
}
mtcars %>% wanted(groups = vars(mpg,disp,hp))
# A tibble: 31 x 4
mpg disp hp n
<dbl> <dbl> <dbl> <int>
1 10.4 460 215 1
2 10.4 472 205 1
3 13.3 350 245 1
4 14.3 360 245 1
5 14.7 440 230 1
6 15 301 335 1
7 15.2 276. 180 1
8 15.2 304 150 1
9 15.5 318 150 1
10 15.8 351 264 1
# … with 21 more rows
The triple bang operator and parse_quos from the rlang package will do the trick. For more info, see e.g. https://stackoverflow.com/a/49941635/6086135
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
vec <- c("Species", "group2")
wanted <- function(data, groups){
data %>% count(!!!rlang::parse_quos(groups, rlang::current_env()))
}
wanted(iris, vec)
#> # A tibble: 15 x 3
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10
Created on 2020-01-06 by the reprex package (v0.3.0)
Here is another option to avoid quotations in the function call. I admit its not very pretty though.
library(tidyverse)
wanted <- function(data, groups){
grouping <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
data %>% group_by_at(grouping) %>% count()
}
iris$group2 <- rep(1:5, 30)
wanted(iris, groups = c(Species, group2) )
#> # A tibble: 15 x 3
#> # Groups: Species, group2 [15]
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10

Resources