girders <- mutate(materials.split[[girder.Option]], bridge.Q = girders.V, interventions = interventions.girders)
Hello people. I want to ask about mutate function in Dplyr for R. When I see the syntax of mutate, I generally see only two parts in the mutate function. But I have an example above. There are three parts. What does it exactly mean this line? What does the coder want to do there? For example, mutate creates new columns in a table. But what does it mean here girders <- mutate ? is "girders" the new name of the new column which is created? Could you explain this?
The number of parameters in dplyr functions may vary depending upon the context. The idea of the pipeline in dplyr is to pass the result of the previous function as the first (data) parameter to the next function (dplyr verb). So you can substitute the provided line of code with the following dplyr equivalent:
girders <- materials.split[[girder.Option]] %>%
mutate(bridge.Q = girders.V, interventions = interventions.girders)
materials.split[[girder.Option]] is set then after %>% passed as the first parameter of mutate. If you add another %>% operator the resulting dataset will be passed to the following verb and so on. In this case the function will not require setting the first dataset parameter.
<- is the assignment operator in R. x <- y means we are assigning the right-hand side of the operation (aka, 'y') to a new (or overriden) variable x . It has nothing to do with dplyr.
Try a few examples to understand it:
a <- 5+6
a
[1] 11
#or to illustrate with dplyr
library(dplyr)
tiny_iris <- iris %>%
select(1:2) %>%
slice(1:2)
tiny_iris
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
Related
In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.
Consider the following dataset.
library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
mutate(g2 = ifelse(gear == 2, 1, 0),
g3 = ifelse(gear == 3, 1, 0),
g4 = ifelse(gear == 4, 1, 0))
Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:
answer = mtcars %>%
summarise(sum_g2 = sum(g2),
sum_g3 = sum(g3),
sum_g4 = sum(g4))
But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:
columns_to_sum = c("g2","g3","g4")
formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")
answer = mtcars %>%
summarise(!!!parse_exprs(formulas))
If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.
Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).
In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.
How do I summarise with a dynamic number of columns (using an rlang approach)?
One option since dplyr 1.0.0 could be:
mtcars %>%
summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
sum_g2 sum_g3 sum_g4
1 0 15 12
Your attempt gives the correct answer but do not give column names as expected.
Here's an approach using map to get the names correct :
library(dplyr)
library(rlang)
library(purrr)
map_dfc(columns_to_sum, ~mtcars %>%
summarise(!!paste0('sum_', .x) := sum(!!sym(.x))))
# sum_g2 sum_g3 sum_g4
#1 0 15 12
You can also use this simple base R approach without any NSE-stuff :
setNames(data.frame(t(colSums(mtcars[columns_to_sum]))),
paste0('sum_', columns_to_sum))
and same in dplyr way :
mtcars %>%
summarise(across(all_of(columns_to_sum), sum)) %>%
set_names(paste0('sum_', columns_to_sum))
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]
I have an imported data frame that has column names with various punctuations including parentheses, e.g. BILLNG.STATUS.(COMPLETED./.INCOMPLTE) .
I was trying to use group_by from dplyr to do some summarizing, something like
df <- df %>% group_by(ORDER.NO, BILLNG.STATUS.(COMPLETED./.INCOMPLTE))
which brings the error Error in mutate_impl(.data, dots) :
could not find function "BILLNG.STATUS."
Short of changing the column names, is there a way to handle such column names directly in group_by ?
I think you can make this work if you enclose the "illegal" column names in backticks. For example, let's say I start with this data frame (called df):
BILLING.STATUS.(COMPLETED./.INCOMPLETE) ORDER.VALUE.(USD)
1 A 0.01544196
2 A 0.95522706
3 B 1.13479303
4 B 1.22848285
Then I can summarise it like this:
dat %>% group_by(`BILLING.STATUS.(COMPLETED./.INCOMPLETE)`) %>%
summarise(count=n(),
mean = mean(`ORDER.VALUE.(USD)`))
Giving:
BILLING.STATUS.(COMPLETED./.INCOMPLETE) count mean
1 A 2 0.4853345
2 B 2 1.1816379
Backticks also come in handy for referring to or creating variable names with whitespace. You can find a number of questions related to dplyr and backticks on SO, and there's also some discussion of backticks in the help for Quotes.
I'm just using this not-an-answer as a counter-example or illustration of limitations for the the backtick method. (It was the first strategem I tried. Perhaps it is the fact that two language operations ("(" and "/") are being handled adjacently that makes this fail.)
names(iris)[5] <- "Specie(/)s"
library(dplyr)
by_species <- iris %>% group_by(`Specie(/)s`)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))
#Error: cannot modify grouping variable
Tried a variety or other language-oriented efforts with quote, as.name and substitute that also failed. (I wish there were a mechanism to request that this sink to the bottom of the answers.)
I noticed that the order in which the dplyr functions when used in pipeline impacts the result. for example:
iris %>%
group_by(Species) %>%
mutate(Sum = sum(Sepal.Length))
produces different results than this:
iris %>%
mutate(Sum = sum(Sepal.Length)) %>%
group_by(Species)
Can anyone explain the reason for this, and if there are any specific order in which they have to be defined, please mention the same.
Thank you
FYI: iris is an inbuilt dataset in R,use data(iris) to load it. I was trying to add a new column, sum of sepal lengths for each species.
Yes, the order matters.
The pipe is equivalent to:
iris<-group_by(iris, Species)
iris<-mutate(iris, Sum = sum(Sepal.Length))
If you change the order, you change the result. If you group by species first, you'll have the result of the sum by species (I guess that's what you want).
However if you group by species after the sum, this sum will correspond to summing the Sepal length for all species.
Yes, the order matters because each part of the pipe is evaluated on its own, starting from the first through to the last pipe-part and the result of the previous pipe (or original dataset) is piped forward to the next following pipe-part. That means, if you use group_by after the mutate as in your example, the mutate will be done without grouping.
One side effect is that you can create complex and long pipes where you control the order of operations (by positioning them at the right part of the pipe) and you don't need to start a new pipe after an operation is finished.
I have an imported data frame that has column names with various punctuations including parentheses, e.g. BILLNG.STATUS.(COMPLETED./.INCOMPLTE) .
I was trying to use group_by from dplyr to do some summarizing, something like
df <- df %>% group_by(ORDER.NO, BILLNG.STATUS.(COMPLETED./.INCOMPLTE))
which brings the error Error in mutate_impl(.data, dots) :
could not find function "BILLNG.STATUS."
Short of changing the column names, is there a way to handle such column names directly in group_by ?
I think you can make this work if you enclose the "illegal" column names in backticks. For example, let's say I start with this data frame (called df):
BILLING.STATUS.(COMPLETED./.INCOMPLETE) ORDER.VALUE.(USD)
1 A 0.01544196
2 A 0.95522706
3 B 1.13479303
4 B 1.22848285
Then I can summarise it like this:
dat %>% group_by(`BILLING.STATUS.(COMPLETED./.INCOMPLETE)`) %>%
summarise(count=n(),
mean = mean(`ORDER.VALUE.(USD)`))
Giving:
BILLING.STATUS.(COMPLETED./.INCOMPLETE) count mean
1 A 2 0.4853345
2 B 2 1.1816379
Backticks also come in handy for referring to or creating variable names with whitespace. You can find a number of questions related to dplyr and backticks on SO, and there's also some discussion of backticks in the help for Quotes.
I'm just using this not-an-answer as a counter-example or illustration of limitations for the the backtick method. (It was the first strategem I tried. Perhaps it is the fact that two language operations ("(" and "/") are being handled adjacently that makes this fail.)
names(iris)[5] <- "Specie(/)s"
library(dplyr)
by_species <- iris %>% group_by(`Specie(/)s`)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))
#Error: cannot modify grouping variable
Tried a variety or other language-oriented efforts with quote, as.name and substitute that also failed. (I wish there were a mechanism to request that this sink to the bottom of the answers.)