Subset larger Dataframe into smaller ones in R?

Subset larger Dataframe into smaller ones in R? - r

I have a larger dataframe from which I would like to split up based on 2 columns and a changing 3rd column.
Am on mobile so hard to give a reproducible example so I will try my best to describe.
I have a large Dataframe with 10 columns, the first 2 being ID and Year.
I would like to have smaller ones where the 3rd column will be each of the remaining 8.
So a total of 8 smaller dataframes
I have tried:
newDF1<-select(BIGdf, c("ID", "Year", "3rdVariable"))
newDF2<-select(BIGdf, c("ID", "Year", "4thVariable"))
And achieve the result but is there a way I don't have to write out each individual variable.
Sorry for the poor formatting any help would be appreciated.

It is usually bad practice to split up data which belongs together.
However, you can automatically create new R objects based on expressions using assign:
library(tidyverse)
columns <-
iris %>%
colnames() %>%
setdiff("Species")
columns
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
columns %>%
walk(~ {
data <- iris %>% head(2) %>% select_at(c("Species", .x))
assign(.x, data, envir = globalenv())
})
# access created objects
Sepal.Width
#> Species Sepal.Width
#> 1 setosa 3.5
#> 2 setosa 3.0
Sepal.Length
#> Species Sepal.Length
#> 1 setosa 5.1
#> 2 setosa 4.9
Created on 2021-11-25 by the reprex package (v2.0.1)

Adding to danlooos answer, if you want a Base R solution you could just use a loop:
for (col in colnames(iris)[-1:-2]) {
assign(col, iris[, c("Sepal.Length", "Sepal.Width", col)], envir = globalenv())
}
Or do the same thing and store the resulting data frames in a list, which I personally find somewhat cleaner:
new_frames <- list()
for (col in colnames(iris)[-1:-2]) {
new_frames <- append(new_frames, list(iris[, c("Sepal.Length", "Sepal.Width", col)]))
}

Related

Iterate sequentially over two lists in R

I have two df that look something like this
library(tidyverse)
iris <- iris%>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
My aim is to reduce the values of the variables in iris that are above the maximum values of the corresponding variable in iris2, to match the maximum value in iris2.
I have written a function that does this.
max(iris$Sepal.Length)
[1] 9.9
max(iris2$Sepal_Length)
[1] 7.9
# i want every value of iris that is >= to max value of iris2 to be equal to the max value of iris 2.
# my function:
fixmax<- function(data,data2,var1,var2) {
data<- data %>%
mutate("{var1}" := ifelse(get(var1)>=max(data2[[var2]],na.rm = T),
max(data2[[var2]],na.rm = T),get(var1)))
return(data)
}
# apply my function to a variable
tst_iris <- fixmax(iris,iris2,"Sepal.Length","Sepal_Length")
max(tst_iris$Sepal.Length)
7.9 # it works!
The challange I face is that I would like to iterate my function sequentially overtwo lists of variables- i.e. Sepal.Length with Sepal_Length, Sepal.Widthwith Sepal_Width etc.
Does anyone knows how I can do this?
I tried using Map but I am doing something wrong.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
My goal is to obtain a df (final_iris) where every variable has been adjusted using the criteria specified by fixmax.
I know I can do this by running my function on every variable like so.
final_iris <- iris
final_iris <- fixmax(final_iris,iris2,"Sepal.Length","Sepal_Length")
final_iris <- fixmax(final_iris,iris2,"Sepal.Width","Sepal_Width")
final_iris <- fixmax(final_iris,iris2,"Petal.Length","Petal_Length")
final_iris <- fixmax(final_iris,iris2,"Petal.Width","Petal_Width")
But in the real data, I have to run this operation tens of times and I would like to be able to loop my function sequentially.
Does anyone know how I loop my fixmax over lst1 and lst2 sequentially?

Rather than explicitly iterating over the different datasets and columns by name, you can take advantage of the vectorization built into R. If the dataframes have the same column/variable ordering a function mapped to both dataframes using mapply or purrr::map2 will iterate column by column without the need to specify column names.
Given two input data frames (df_small and df_big) the steps are:
Calculate the max of each column in df_small to create df_small_max
Apply the pmin function to each column of df_big and each value of df_small_max using mapply (or purr::map2_dfc if you prefer tidyverse mapping)
#set up fake data
df_small <- iris[,1:4]
df_big <- df_small + 2
# find max of each col in df_small
df_small_max <- sapply(df_small, max)
# replace values of df_big which are larger than df_small_max
df_big_fixed <- mapply(pmin, df_big, df_small_max)
# sanity check -- Note the change in Sepal.Width
df_small_max
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 7.9 4.4 6.9 2.5
head(df_big, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 7.1 5.5 3.4 2.2
#> 2 6.9 5.0 3.4 2.2
#> 3 6.7 5.2 3.3 2.2
head(df_big_fixed, 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] 7.1 4.4 3.4 2.2
#> [2,] 6.9 4.4 3.4 2.2
#> [3,] 6.7 4.4 3.3 2.2
Created on 2021-07-31 by the reprex package (v2.0.0)

It's likely that your issue is related to the fact that dataframes are themselves lists. Map() expects the non-function arguments to be lists of the same length. Any arguments that are shorter than the longest list are "recycled" to match it's length.
Currently, you have:
final_iris<- Map(fixmax,iris, iris2,lst1,lst2)
This is actually equivalent to:
final_iris<- Map(fixmax,
list(iris$Sepal.Length,
iris$Sepal.Width,
iris$Petal.Length,
iris$Petal.Width,
iris$Species),
list(iris2$Sepal_Length,
iris2$Sepal_Width,
iris2$Petal_Length,
iris2$Petal_Width,
iris2$Species),
lst1,
lst2)
(To understand why, you must remember that dataframes like iris and iris2 are, technically, under the hood, lists of [atomic] vectors.)
I suspect that you want iris and iris2 to be supplied to each call to fixmax(). In order to have Map() recycle these two vectors, they need to be supplied as single-element lists. Like so:
final_iris<- Map(fixmax, list(iris), list(iris2),lst1,lst2)
To combine a list of dataframes into a single dataframe do
do.call(rbind, final_iris)

Here is a mostly base way. I also renamed the variables because I had some trouble replicating since originally the approach would save over the iris object.
The approach is that instead of mutating a data.frame object, we instead only return the vector of the expected values from our modified function. Then, we re-assign those values back to our original data.frame.
fixmax2 = function(x, y) {
max_y = max(y, na.rm = TRUE)
ifelse(x >= max_y, max_y, y)
}
cols = which(sapply(df_plus, is.numeric))
df_plus[cols] = Map(fixmax2, df_plus[cols], df_iris[cols])
df_plus
Raw data:
library(dplyr)
df_plus = iris %>% mutate_at((1:4), ~. + 2) ## let's not save over iris
df_iris = iris
names(df_iris)<-sub(".", "_", names(df_iris), fixed = TRUE)

Is that what you're expecting ?
my_a <- iris %>% mutate_at((1:4),~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
my_var <- which(my_a$Sepal.Length >= max(iris2$Sepal_Length) & my_a$Sepal.Width >= max(iris2$Sepal_Width))
if (length(my_var)) {
my_a <- my_a[my_var,]
}

Your function seems convoluted and hard to read at a first glance. We can tidy up the function to return max(x, max_val) for each value in a column with a quick function
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
Finally, we want to apply this automatically and sequentially using the two dataframes. We will use a simple for loop. Code to set up the problem is attached.
#libraries
library(tidyverse)
#set up fake data
iris_big <- iris%>% mutate_at((1:4),~.+2)
iris_small <- iris
names(iris_small)<- sub(".", "_", names(iris_small), fixed = TRUE)
#check which is the bigger one and the smaller
max(iris_big$Sepal.Length) #bigger
max(iris_small$Sepal_Length) #smaller
#function to correct max
adjust_max <- function(x, max_val) {
return(ifelse(x >= max_val, max_val, x))
}
#apply it to get a final result
iris_final <- iris_big
# iterate over columns, assuming same positions
# you can edit the 1:ncol(iris_final) to only take the columns you want
for (i in 1:ncol(iris_final)) {
#check numeric
if (is.numeric(iris_final[,i])) {
#applies the function - notice we call iris_final and iris_small
iris_final[,i] <- sapply(iris_final[,i],
adjust_max,
max_val = max(iris_small[,i]))
}
}
#check answer is correct
apply(iris_final[,1:4], 2, max)
apply(iris_small[,1:4], 2, max)
tail(iris_final)

For a tidyverse approach you can use transmute instead of mutate. transmute would return only one column in each iteration whereas mutate would return all the columns every time.
Apart from that to make it more tidyverse friendly I am using .data instead of get. Also using pmin instead of complicated ifelse solution.
library(dplyr)
library(purrr)
fixmax<- function(data,data2,var1,var2) {
data<- data %>% transmute("{var1}" := pmin(.data[[var1]], max(data2[[var2]])))
return(data)
}
To apply the function to each pair of columns you can use map2_dfc which will also combine the results in one dataframe.
lst1 <- names(iris[,1:4])
lst2 <- names(iris2[,1:4])
Compare the max values of two dataframes before applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 9.9 6.4 8.9 4.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5
Apply the function -
iris[lst1] <- map2_dfc(lst1, lst2, ~fixmax(iris, iris2, .x, .y))
Compare the max values of two dataframes after applying the function.
map_dbl(iris[lst1], max)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 7.9 4.4 6.9 2.5
map_dbl(iris2[lst2], max)
#Sepal_Length Sepal_Width Petal_Length Petal_Width
# 7.9 4.4 6.9 2.5

You should consider using column indices; a complete (not including the data-frame construction) base R solution could look like:
# Resolve the indices of the numeric vectors in
# iris: num_cols => integer vector
num_cols <- which(
vapply(
iris,
is.numeric,
logical(1)
),
arr.ind = TRUE
)
# Map the pmin function over iris to select the
# minimum of the vector element in iris and the
# maximum values of that vector in iris2:
# iris => data.frame
iris[,num_cols] <- Map(function(i){
pmin(
iris[,i],
max(
iris2[,i],
na.rm = TRUE
)
)
},
num_cols
)

You can do this by creating a matrix of the max value repeated in each column and use pmin to take the minimum values between the max values in iris2 and the values in the other dataframe. I created a new fixmax function which only takes the two dataframes as arguments.
Preparing the data
library(tidyverse)
initial <- iris %>% mutate_at(1:4, ~.+2)
iris2 <- iris
names(iris2)<-sub(".", "_", names(iris2), fixed = TRUE)
print(max(initial$Sepal.Length))
# [1] 9.9
print(max(iris2$Sepal_Length))
# [1] 7.9
Creating the function
fixmax <- function(df, dfmax){
colids <- which(unlist(lapply(dfmax, is.numeric)))
dfmax <- apply(dfmax[, colids], 2, max) %>%
matrix(nrow=nrow(dfmax), ncol=length(colids), byrow=TRUE) %>%
as.data.frame()
df[, colids] <- pmin(df[,colids], dfmax)
return(df)
}
Testing the function
newiris <- fixmax(initial, iris2)
print(max(newiris$Sepal.Length))
# [1] 7.9
assertthat::assert_that(!identical(newiris, iris2))
# [1] TRUE
assertthat::assert_that(all((initial == newiris) || (iris2 == newiris)))
# [1] TRUE
imax = apply(iris2[, 1:4], 2, max) %>%
matrix(nrow=nrow(iris2), ncol=4, byrow=TRUE) %>%
as.data.frame()
assertthat::assert_that(all(newiris[, 1:4] <= imax))
# [1] TRUE
print(head(newiris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.1 4.4 3.4 2.2 setosa
# 2 6.9 4.4 3.4 2.2 setosa
# 3 6.7 4.4 3.3 2.2 setosa
# 4 6.6 4.4 3.5 2.2 setosa
# 5 7.0 4.4 3.4 2.2 setosa
# 6 7.4 4.4 3.7 2.4 setosa

Take function from dataframe

I need to perform calculation based on inputs defined in a dataframe. Refer the dataframe RefDf below. It has 3 columns - column name, calculation, New Variable Name. When Calculation column contains count, we should use n_distinct( ) function.
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length count Petal.LengthNew
", header = T)
Manual Approach - Needs to be automated via inputs in RefDf. Species remains same for grouping.
library(dplyr)
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
I am looking for dplyr or base R based solution

Here's a solution with data.table package
library(data.table)
library(dplyr)
# using data.table
dt <- as.data.table(RefDf)
dt[Calculation == "count", Calculation := "n_distinct"]
# function for doing grouping calculation
inner.fun <- function(calc, data, column, group="Species"){
print(column)
data.dt <- as.data.table(data)
data.dt[, .(as.numeric(get(calc)(get(column)))), by=group][]
}
out <- dt[, inner.fun(calc=Calculation, data=iris, column=Variables), by=NewVariable]
# reshape from wide to long
out2 <- dcast(data=out, Species ~ NewVariable, value.var="V1")
# convert to data.frame
out_df <- as.data.frame(out2)
out_df
Species Petal.LengthNew Sepal.Length2
1 setosa 9 250.3
2 versicolor 19 296.8
3 virginica 20 329.4

Mapping the same function to multiple tibbles in a list

I have several tibbles, each of which has different numbers of columns and different column names. I want to standardize the column names for each to be all lowercase. This works for an individual tibble:
library(magrittr)
library(purrr)
colnames(tbl1) %<>% map(tolower)
The column names for the object tbl1 are now all lowercase.
If I put all my tibbles in a list, however, this doesn't work:
all_tbls <- list(tbl1, tbl2, tbl3)
all_tbls %<>% map(function(tbl) {colnames(tbl) %<>% map(tolower)})
The colnames for the objects tbl1, tbl2, and tbl3 are not changed by this. The objects in the list all_tbls are now lists of the column names for each tbl, i.e. what you'd get if you applied as.list() to the result of colnames()).
Why is this happening? Is there a better approach to doing this? I'd prefer to use tidyverse functions (e.g. map instead of *apply) for consistency with other code, but am open to other solutions. EDIT: To be clear, I'd like to be able to work with the original tibble objects, i.e. the desired outcome is for the colnames of tbl1, tbl2, and tbl3 to change.
Other Q&A I looked at and did not find illuminating includes:
apply function to elements over a list
R: apply a function to a list of dataframes and save to workspace

library(magrittr)
library(purrr)
all_tbls %<>% map(~set_names(.x,tolower(colnames(.x))))
The objects in the list all_tbls are now lists of the column names for each tbl
Because you're asking map to lower the column names and return them as a list
To modify in place we can use data.table::setnames, since data.table using copy in place against copy by reference.
library(data.table)
library(purrr)
map(list(df1,df2),~setnames(.,old = names(.), new = tolower(names(.))))
Data
df1 <- read.table(text='
A B
1 1',header=TRUE)
df2 <- read.table(text='
C D
2 2',header=TRUE)

The function you're mapping is returning the column names, you need it to return the actual tibble instead:
all_tbls %<>% map(function(tbl) {
colnames(tbl) %<>% map(tolower)
tbl
})

You can use purrr::map and dplyr::rename_all :
all_tbls <- list(head(iris,2), head(cars,2))
library(tidyverse)
all_tbls %>% map(rename_all, toupper)
# [[1]]
# SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
#
# [[2]]
# SPEED DIST
# 1 4 2
# 2 4 10
#

take mean of variable defined by string in dplyr

Seems like this should be easy but I'm stumped. I've gotten the rough hang of programming with dplyr 0.7, but struggling with this: How do I program in dplyr if the variable I want to program with will be a string?
I am scraping a database, and for a variety of reasons want to summarize a variable that I will know the position of but not the name of (the thing I want is always the first column of the supplied table, but the name of the variable stored in that column will vary depending on the database being scraped). To use iris as an example, suppose that I know that the variable that I want is in the first column
library(tidyverse)
desired_var <- colnames(iris)[1]
print(desired_var)
"Sepal.Length"
I now want to group by Species, and take the mean of desired_var, i.e. what I want is to perform
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(Sepal.Length))
But, now I want to take the mean of a column which is defined by a string stored in desired_var
I get how to do this with a "bare" Sepal.Length
desired_var <- quo(Sepal.Length)
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!desired_var))
But how in the world do I deal with the fact that I have "Sepal.Length" not Sepal.Length , i.e. that desired_var <- "Sepal.Length" ?

You're wondering into tidyeval which is a rather new feature of the tidyverse (see here) more used to create functions using tidyverse functions. For now it is only available with dplyr but the plan is to extend it to the other tidyverse packages.
For your need though, you don't really need to get into that, when summarize_at will do. This function allows you to extend a particular manipulation that you specify across any variables of your choosing:
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of("Sepal.Length", "Sepal.Width")), funs(desired_mean = mean))
# A tibble: 3 x 3
Species Sepal.Length_desired_mean Sepal.Width_desired_mean
<fctr> <dbl> <dbl>
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
You can store the list of variables into a vector, and then use that vector instead:
selected_vectors <- c("Sepal.Length", "Sepal.Width")
iris %>%
group_by(Species) %>%
summarise_at(vars(one_of(selected_vectors)), funs(desired_mean = mean))

1) dynamic variable with !!sym Use sym (or parse_expr) like this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise(desired_mean = mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species desired_mean
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
2) summarise_at As #Phil points out in the comments in the particular case of summarise this could be done like this without using any rlang facilities:
library(dplyr)
desired_var <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarise_at(desired_var, funs(mean)) %>%
ungroup
giving:
# A tibble: 3 x 2
Species Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
3) dynamic variable and name with !! If you need to set the name dynamically in (1) then try this:
library(dplyr)
library(rlang)
desired_var <- "Sepal.Length"
desired_var_name <- paste("mean", desired_var, sep = "_")
iris %>%
group_by(Species) %>%
summarise(!!desired_var_name := mean(!!sym(desired_var))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species mean_Sepal.Length
<fctr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588

R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you

Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).

If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.

If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)

I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset larger Dataframe into smaller ones in R? - r

Related

Iterate sequentially over two lists in R

Take function from dataframe

Mapping the same function to multiple tibbles in a list

take mean of variable defined by string in dplyr

R dplyr summarise multiple functions to selected variables

Categories

Resources