R - Capture the group_by dataframe - r

I am using iris dataframe
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3.0 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5.0 3.6 1.4 0.2
6 setosa 5.4 3.9 1.7 0.4
7 virginica 6.7 3.3 5.7 2.5
8 virginica 6.7 3.0 5.2 2.3
9 virginica 6.3 2.5 5.0 1.9
10 virginica 6.5 3.0 5.2 2.0
11 virginica 6.2 3.4 5.4 2.3
12 virginica 5.9 3.0 5.1 1.8
I would like to group the dataframe by Species and summarise the data using a custom generic function
My proposed code is the following:
iris %>% group_by(Species) %>% summarise(MySummary= GenericSummaryFunction(.))
GenericSummaryFunction <- function (x){.....}
The problem I am facing is that the dataframe being passed to GenericSummaryFunction is ungrouped, thus the output is not group specific.
Species MySummary
<fct> <dbl>
1 setosa 5.80
2 virginica 5.80
I am not sure what to replace "." for in iris %>% group_by(Species) %>% summarise(MySummary= GenericSummaryFunction(.)) to pass the grouped dataframe instead of the whole dataframe. I am using dyplr 0.83

Try with this and tell me if you have problems.
It should work with older versions of dplyr.
library(dplyr)
library(tidyr)
library(purrr)
GenericSummaryFunction <- function(x){
# I just came up with something meaningless to get one number
sum(x)
}
iris %>%
nest(-Species) %>%
mutate(data = map(data, GenericSummaryFunction)) %>%
unnest()

Related

Dynamically "gluing" with {glue}

Is there any way to supply a vector to {glue} to dynamically choose which columns get "glued"? desired here is what would I am hoping to see but I just want to be able to provide vars to a glue statement.
library(glue)
library(dplyr)
vars <- c("Sepal.Length", "Species")
iris %>%
head() %>% ## just for less data
# mutate(glue_string = glue_data("{vars}")) %>%
mutate(desired = glue("{Sepal.Length}{Species}"))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species desired
#> 1 5.1 3.5 1.4 0.2 setosa 5.1setosa
#> 2 4.9 3.0 1.4 0.2 setosa 4.9setosa
#> 3 4.7 3.2 1.3 0.2 setosa 4.7setosa
#> 4 4.6 3.1 1.5 0.2 setosa 4.6setosa
#> 5 5.0 3.6 1.4 0.2 setosa 5setosa
#> 6 5.4 3.9 1.7 0.4 setosa 5.4setosa
We may either use .data to extract the column from each element of 'vars' and glue it
library(dplyr)
library(glue)
iris %>%
head() %>% ## just for less data
mutate(desired = glue("{.data[[vars[1]]]}{.data[[vars[2]]]}"))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species desired
1 5.1 3.5 1.4 0.2 setosa 5.1setosa
2 4.9 3.0 1.4 0.2 setosa 4.9setosa
3 4.7 3.2 1.3 0.2 setosa 4.7setosa
4 4.6 3.1 1.5 0.2 setosa 4.6setosa
5 5.0 3.6 1.4 0.2 setosa 5setosa
6 5.4 3.9 1.7 0.4 setosa 5.4setosa
Or loop across all_of the elements in 'vars' to subset the data, invoke str_c to paste the columns by row
library(stringr)
library(purrr)
iris %>%
head() %>% ## just for less data
mutate(desired = invoke(str_c, across(all_of(vars))))

Having trouble using dplyr in R to group by then mutate and generate statistic by group

I have read several other posts where people seem to have had the same problem. I am using the example of the iris dataset. I want to find the max by group but instead it is giving me the max across the whole dataset.
I tried to detach(plyr) because it said to make sure you load dplyr after plyr. I also tried adding dplyr:: before the commands. But neither of those seem to make a difference. I am using dplyr version 1.0.2
This is the code I am using. I am new to posting so not sure how to show the mistake in the data or how to make data show up correctly. This is what I get but max_sepal should be 5.8 for that first group. Thank you for your help!
iris_1 <- iris %>%
dplyr::group_by(Species) %>%
dplyr::mutate(max_sepal = max(iris$Sepal.Length, na.rm=TRUE))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max_sepal
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 7.9
2 4.9 3 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.9
5 5 3.6 1.4 0.2 setosa 7.9
6 5.4 3.9 1.7 0.4 setosa 7.9
7 4.6 3.4 1.4 0.3 setosa 7.9
8 5 3.4 1.5 0.2 setosa 7.9
9 4.4 2.9 1.4 0.2 setosa 7.9
10 4.9 3.1 1.5 0.1 setosa 7.9
What you are likely looking for is the summarise function which is also part of dplyr. A quick distinction between mutate and summarise is below.
mutate() either changes an existing column or adds a new one.
summarise() calculates a single value (per group).
iris %>%
group_by(Species) %>%
summarise(max_sepal = max(Sepal.Length, na.rm = TRUE))
# A tibble: 3 x 2
Species max_sepal
<fct> <dbl>
1 setosa 5.8
2 versicolor 7
3 virginica 7.9
You can some more examples of this below
https://community.rstudio.com/t/what-is-difference-between-mutate-and-summarise/23103/3

Write files for several column filter

Given a data.frame and a set of columns, I'd like to write a csv file (or text file in general)
for each column containing but containing information for all columns, however rows filtered based on the respective column.
For example, say I'd like to save a file for each, Sepal.Width and Sepal.Length containing the top 5 rows for each respectively:
top_n(iris, 5, Sepal.Width)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.4 3.9 1.7 0.4 setosa
# 2 5.8 4.0 1.2 0.2 setosa
# 3 5.7 4.4 1.5 0.4 setosa
# 4 5.4 3.9 1.3 0.4 setosa
# 5 5.2 4.1 1.5 0.1 setosa
# 6 5.5 4.2 1.4 0.2 setosa
# this should go in top5_Sepal.Width.csv
top_n(iris, 5, Sepal.Length)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.7 3.8 6.7 2.2 virginica
# 2 7.7 2.6 6.9 2.3 virginica
# 3 7.7 2.8 6.7 2.0 virginica
# 4 7.9 3.8 6.4 2.0 virginica
# 5 7.7 3.0 6.1 2.3 virginica
# this should go in top5_Sepal.Length.csv
I've tried something like below, however I don't know to write the mywrite function i.e. how access the whole data.frame for filtering (.x only contains the column)
myvars <- c("Sepal.Width", "Sepal.Length")
tmp <- iris %>%
map_at(myvars, ~mywrite(.x))
Alternatively, purrr:map2 allows to track names, but doesn't come in an _atflavour (I guess filtering can be done in mywrite then..
However, again no access to the whole iris data.frame:
tmp <- iris %>%
map2(., colnames(iris), ~mywrite(.x, .y))
As a third option, I think there is the option to loop over the column names, possibly using tidy evaluation i.e. mycol <- sym(myvars[i]) and !!mycol , but wanted to refrain from for loops ideally.
Note, this is a somewhat silly toy example which only serves to illustrate the issue.
Note2, this answer is similar but based on groups within a column rather than individual columns:
We can use map with non-standard evaluation to get top 5 values for each myvars
library(dplyr)
purrr::map(myvars, ~top_n(iris, 5, !!sym(.x)))
#[[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.4 3.9 1.7 0.4 setosa
#2 5.8 4.0 1.2 0.2 setosa
#3 5.7 4.4 1.5 0.4 setosa
#4 5.4 3.9 1.3 0.4 setosa
#5 5.2 4.1 1.5 0.1 setosa
#6 5.5 4.2 1.4 0.2 setosa
#[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 7.7 3.8 6.7 2.2 virginica
#2 7.7 2.6 6.9 2.3 virginica
#3 7.7 2.8 6.7 2.0 virginica
#4 7.9 3.8 6.4 2.0 virginica
#5 7.7 3.0 6.1 2.3 virginica
If you want to write each part to csv, you can extend the pipe to
map(myvars, ~top_n(iris, 5, !!sym(.x)) %>% write.csv(paste0("top5_", .x, ".csv")))

How to calculate the difference between consecutive sets of two columns in a column range using dplyr

I would like to calculate the difference between consecutive columns in a range of columns using dplyr.
For example, using the iris data set I would want to be able to specify the range Sepal.Width:Petal.Width and have a dataframe that contained the original iris data and the differences between the consecutive columns from Sepal.Width:Petal.Width:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species diff1 diff2
1 5.1 3.5 1.4 0.2 setosa 2.1 1.2
2 4.9 3.0 1.4 0.2 setosa 1.6 1.2
3 4.7 3.2 1.3 0.2 setosa 1.9 1.1
4 4.6 3.1 1.5 0.2 setosa 1.6 1.3
5 5.0 3.6 1.4 0.2 setosa 2.2 1.2
6 5.4 3.9 1.7 0.4 setosa 2.2 1.3
Someone posted a solution loops and lapply (Calculate the difference between consecutive, grouped columns in a data.table) but I'm specifically looking for a dplyr solution.
Here's a less advanced approach using dplyr and tidyr verbs. First I gather the columns for differencing into long format, then take their differences vs. previous column, strip out NA's for the first columns which have no previous column, rename the column, spread out, and attach onto the original.
library(tidyverse)
iris %>%
bind_cols(iris %>%
rowid_to_column() %>%
gather(col, val, Sepal.Width:Petal.Width) %>%
group_by(rowid) %>%
mutate(val = abs(val - lag(val))) %>%
filter(!is.na(val)) %>%
mutate(col = paste0("diff_", col)) %>%
spread(col, val) %>%
select(contains("diff"))
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species rowid diff_Petal.Length diff_Petal.Width
1 5.1 3.5 1.4 0.2 setosa 1 2.1 1.2
2 4.9 3.0 1.4 0.2 setosa 2 1.6 1.2
3 4.7 3.2 1.3 0.2 setosa 3 1.9 1.1
4 4.6 3.1 1.5 0.2 setosa 4 1.6 1.3
5 5.0 3.6 1.4 0.2 setosa 5 2.2 1.2
6 5.4 3.9 1.7 0.4 setosa 6 2.2 1.3
7 4.6 3.4 1.4 0.3 setosa 7 2.0 1.1
Here is an option with tidyverse. We select the range of columns, remove the first and last column in to a list of data.frames, then use reduce to get the difference between set of equal dimension datasets, and rename the columns
library(dplyr)
library(purrr)
library(stringr)
out <- iris %>%
select(Sepal.Width:Petal.Width) %>%
{list(.[-length(.)], .[-1])} %>%
reduce(`-`) %>%
rename_all(~ str_c("diff", seq_along(.))) %>%
bind_cols(iris, .)
head(out)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species diff1 diff2
#1 5.1 3.5 1.4 0.2 setosa 2.1 1.2
#2 4.9 3.0 1.4 0.2 setosa 1.6 1.2
#3 4.7 3.2 1.3 0.2 setosa 1.9 1.1
#4 4.6 3.1 1.5 0.2 setosa 1.6 1.3
#5 5.0 3.6 1.4 0.2 setosa 2.2 1.2
#6 5.4 3.9 1.7 0.4 setosa 2.2 1.3
Or another approach is to loop through index of columns select the columns, reduce it to a single column with - and bind with the original dataset
map_dfc(3:4, ~ iris %>%
select(.x-1, .x) %>%
transmute(diff = reduce(., `-`))) %>%
bind_cols(iris, .)
You could also use grepl and which to get the column indices..
start <- which(grepl("Sepal.Width", colnames(iris)))
end <- which(grepl("Petal.Width", colnames(iris)))
for (i in start:(end-1)) {
eval(parse(text = paste0("iris$diff",i-1," <- iris[,",i,"]-iris[,",i,"+1]")))
}

dplyr: how to reference columns by column index rather than column name using mutate?

Using dplyr, you can do something like this:
iris %>% head %>% mutate(sum=Sepal.Length + Sepal.Width)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
But above, I referenced the columns by their column names. How can I use 1 and 2 , which are the column indices to achieve the same result?
Here I have the following, but I feel it's not as elegant.
iris %>% head %>% mutate(sum=apply(select(.,1,2),1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
You can try:
iris %>% head %>% mutate(sum = .[[1]] + .[[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
I'm a bit late to the game, but my personal strategy in cases like this is to write my own tidyverse-compliant function that will do exactly what I want. By tidyverse-compliant, I mean that the first argument of the function is a data frame and that the output is a vector that can be added to the data frame.
sum_cols <- function(x, col1, col2){
x[[col1]] + x[[col2]]
}
iris %>%
head %>%
mutate(sum = sum_cols(x = ., col1 = 1, col2 = 2))
An alternative to reusing . in mutate that will respect grouping is to use dplyr::cur_data_all(). From help(cur_data_all)
cur_data_all() gives the current data for the current group (including grouping variables)
Consider the following:
iris %>% group_by(Species) %>% mutate(sum = .[[1]] + .[[2]]) %>% head
#Error: Problem with `mutate()` column `sum`.
#ℹ `sum = .[[1]] + .[[2]]`.
#ℹ `sum` must be size 50 or 1, not 150.
#ℹ The error occurred in group 1: Species = setosa.
If instead you use cur_data_all(), it works without issue:
iris %>% mutate(sum = select(cur_data_all(),1) + select(cur_data_all(),2)) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
The same approach works with the extract operator ([[).
iris %>% mutate(sum = cur_data()[[1]] + cur_data()[[2]]) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
What do you think about this version?
Inspired by #SavedByJesus's answer.
applySum <- function(df, ...) {
assertthat::assert_that(...length() > 0, msg = "one or more column indexes are required")
mutate(df, Sum = apply(as.data.frame(df[, c(...)]), 1, sum))
}
iris %>%
head(2) %>%
applySum(1, 2)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
#
### you can select and sum more then two columns by the same function
#
iris %>%
head(2) %>%
applySum(1, 2, 3, 4)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 9.5
To address the issue that #pluke is asking about in the comments, dplyr doesn't really support column index.
Not a perfect solution, but you can use base R to get around this
iris[1] <- iris[1] + iris[2]
This can now (packageVersion("dplyr") >= 1.0.0) be done very nicely with the combination of dplyr::rowwise() and dplyr::c_across().
library(dplyr)
packageVersion("dplyr")
#> [1] '1.0.10'
iris %>%
head %>%
rowwise() %>%
mutate(sum = sum(c_across(c(1, 2))))
#> # A tibble: 6 × 6
#> # Rowwise:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 8.6
#> 2 4.9 3 1.4 0.2 setosa 7.9
#> 3 4.7 3.2 1.3 0.2 setosa 7.9
#> 4 4.6 3.1 1.5 0.2 setosa 7.7
#> 5 5 3.6 1.4 0.2 setosa 8.6
#> 6 5.4 3.9 1.7 0.4 setosa 9.3
Created on 2022-11-01 with reprex v2.0.2

Resources