mutate and/or summarise a dynamic number of columns - r

In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.
Consider the following dataset.
library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
mutate(g2 = ifelse(gear == 2, 1, 0),
g3 = ifelse(gear == 3, 1, 0),
g4 = ifelse(gear == 4, 1, 0))
Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:
answer = mtcars %>%
summarise(sum_g2 = sum(g2),
sum_g3 = sum(g3),
sum_g4 = sum(g4))
But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:
columns_to_sum = c("g2","g3","g4")
formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")
answer = mtcars %>%
summarise(!!!parse_exprs(formulas))
If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.
Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).
In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.
How do I summarise with a dynamic number of columns (using an rlang approach)?

One option since dplyr 1.0.0 could be:
mtcars %>%
summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
sum_g2 sum_g3 sum_g4
1 0 15 12

Your attempt gives the correct answer but do not give column names as expected.
Here's an approach using map to get the names correct :
library(dplyr)
library(rlang)
library(purrr)
map_dfc(columns_to_sum, ~mtcars %>%
summarise(!!paste0('sum_', .x) := sum(!!sym(.x))))
# sum_g2 sum_g3 sum_g4
#1 0 15 12
You can also use this simple base R approach without any NSE-stuff :
setNames(data.frame(t(colSums(mtcars[columns_to_sum]))),
paste0('sum_', columns_to_sum))
and same in dplyr way :
mtcars %>%
summarise(across(all_of(columns_to_sum), sum)) %>%
set_names(paste0('sum_', columns_to_sum))

Related

convert tidyverse rowwise operation to data.table solution

I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
While what #Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set or :=. When you see the keyword set (which can just be part of the function name) or the := symbol, you've told data.table not to make copies of the data. Without declaring or declaration (that pesky = or <-), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (= or <-), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use the refresh icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table. (Notice that keyword—set!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table. Here are a couple of methods you can use with data.table. These do the exact same thing as each other and your code.
Note the :=, which means the table changed.
In the first example, I used .SD and .SDcols. These are data column selection tools. You use .SD in place of the column name when you want to use more than one column. Then use .SDcols to tell data.table what columns you're trying to use. By annotating (xcols), where xcols is my variable representing my column names to use, this tells data.table to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply, which doesn't have anything to do with data.table. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var to delineate when to apply this, I've used the by argument (as in dplyr's group_by). In terms of the other requirements, you'll see .SD and lapply again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))

Dplyr Mutate Syntax

girders <- mutate(materials.split[[girder.Option]], bridge.Q = girders.V, interventions = interventions.girders)
Hello people. I want to ask about mutate function in Dplyr for R. When I see the syntax of mutate, I generally see only two parts in the mutate function. But I have an example above. There are three parts. What does it exactly mean this line? What does the coder want to do there? For example, mutate creates new columns in a table. But what does it mean here girders <- mutate ? is "girders" the new name of the new column which is created? Could you explain this?
The number of parameters in dplyr functions may vary depending upon the context. The idea of the pipeline in dplyr is to pass the result of the previous function as the first (data) parameter to the next function (dplyr verb). So you can substitute the provided line of code with the following dplyr equivalent:
girders <- materials.split[[girder.Option]] %>%
mutate(bridge.Q = girders.V, interventions = interventions.girders)
materials.split[[girder.Option]] is set then after %>% passed as the first parameter of mutate. If you add another %>% operator the resulting dataset will be passed to the following verb and so on. In this case the function will not require setting the first dataset parameter.
<- is the assignment operator in R. x <- y means we are assigning the right-hand side of the operation (aka, 'y') to a new (or overriden) variable x . It has nothing to do with dplyr.
Try a few examples to understand it:
a <- 5+6
a
[1] 11
#or to illustrate with dplyr
library(dplyr)
tiny_iris <- iris %>%
select(1:2) %>%
slice(1:2)
tiny_iris
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0

mutate_at using function lag but keep first row

I am using dplyr in R (with great joy) and want to get the differential of the columns mpg to gear in mtcars. The first row then returns NA (for obvious reason). Instead of this first row being NA I would like it to stay the original value.
I am looking for a clean and efficient way to achieve this (not using join to add the first row to the differntiated values since the code on my own dataset contains many filters and grouped variables).
my code is as follows:
mtcars %>% mutate_at(vars(mpg:gear), funs(. - lag(., 1)))
I expect the first row to be mtcars[1] and the rest to be the differential
We can specify the default parameter with 0, otherwise, it would be NA
library(dplyr)
mtcars %>%
mutate_at(vars(mpg:gear), list(~ . - lag(., default = 0)))
Or another option is diff with concatenating the first element
mtcars %>%
mutate_at(vars(mpg:gear), list(~ c(first(.), diff(.))))
NOTE: The funs is getting deprecated. Instead of that we are using list

dplyr: Send all Columns to a function within a mutate following a group_by

What is the preferred way to send all columns within a current group to a function as a tibble or data.frame when calling an arbitrary function in a dplyr pipe?
In the example below, mean_B is a simple example where I know what is needed before I make the function call. mean_B_fun gives the wrong answer (compared to what I want-- I want the within-group mean), and mean_B_fun_ugly gives what I want, but it seems like both an inefficient (and ugly) way to get the effect I want.
The reason I want to operate on arbitrary columns is that in practice, I'm taking my_fun in the example below from the user, and I don't know the columns that the user will need to operate on a priori.
library(dplyr)
my_fun <- function(x) mean(x$B)
my_data <-
expand.grid(A=1:3, B=1:2) %>%
mutate(B=A*B) %>%
group_by(A) %>%
mutate(mean_B=mean(B),
mean_B_fun=my_fun(.),
mean_B_fun_ugly=my_fun(as.data.frame(.)[.$A == unique(A),,drop=FALSE]))
here's my answer, not knowing the columns on which you want to calculate the mean.
expand.grid(A=1:3, B=1:2) %>%
mutate(B=A*B) %>% nest(-A) %>%
mutate(means = map(.$data, function(x) colMeans(x)))
A data means
1 1 1, 2 1.5
2 2 2, 4 3
3 3 3, 6 4.5

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

Resources