I want to do something like this
df <- iris %>%
rowwise %>%
mutate(new_var = sum(Sepal.Length, Sepal.Width))
Except I want to do it without typing the variable names, e.g.
names_to_add <- c("Sepal.Length", "Sepal.Width")
df <- iris %>%
rowwise %>%
[some function that uses names_to_add]
I attempted a few things e.g.
df <- iris %>%
rowwise %>%
mutate(new_var = sum(sapply(names_to_add, get, envir = as.environment(.))))
but still can't figure it out. I'll take an answer that plays around with lazyeval or something that's simpler. Note that the sum function here is just a placeholder and my actual function is much more complex, although it returns one value per row. I'd also rather not use data.table
You should check out all the functions that end with _ in dplyr. Example mutate_, summarise_ etc.
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df <- iris %>%
rowwise %>% mutate_(names_to_add)
Edit
The results of the code:
df <- iris %>%
rowwise %>% mutate(new_var = sum(Sepal.Length, Sepal.Width))
names_to_add <- ("sum(Sepal.Length, Sepal.Width)")
df2 <- iris %>%
rowwise %>% mutate_(new_var = names_to_add)
identical(df, df2)
[1] TRUE
Edit
I edited the answer and it solves the problem. I wonder why it was donwvoted. We use SE (standard evaluation), passing a string as an input inside 'mutate_'. More info: vignette("nse","dplyr")
x <- "Sepal.Length + Sepal.Width"
df <- mutate_(iris, x)
head(df)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length + Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
Related
I have a column of numbers that I want to change from a count to a percentage.
This code works:
df <- df %>%
select(casualty_veh_ref, JourneyPurpose ) %>%
group_by(JourneyPurpose) %>%
summarise(Number=n()) %>%
mutate(Percentage=Number/sum(Number)*100)
df$Percentage <- paste(round(df$Percentage), "%", sep="")
But if I try to keep the piping using percent_format from the scales package:
df <- df %>%
select(casualty_veh_ref, JourneyPurpose ) %>%
group_by(JourneyPurpose) %>%
summarise(Number=n()) %>%
mutate(Percentage=Number/sum(Number)) %>%
percent_format(Percentage, suffix = "%")
I receive the error message
Error in force_all(accuracy, scale, prefix, suffix, big.mark, decimal.mark, :
object 'Percentage' not found
I don't understand why the object is not found
Try this: I've used iris for representation.
library(dplyr)
iris %>%
slice(1:4) %>%
mutate(Test=Sepal.Length/45,Test=scales::percent(Test))
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test
1 5.1 3.5 1.4 0.2 setosa 11.33%
2 4.9 3.0 1.4 0.2 setosa 10.89%
3 4.7 3.2 1.3 0.2 setosa 10.44%
4 4.6 3.1 1.5 0.2 setosa 10.22%
It seems like dplyr::pull() and dplyr::select() do the same thing. Is there a difference besides that dplyr::pull() only selects 1 variable?
First, it makes sense to see what class each function creates.
library(dplyr)
mtcars %>% pull(cyl) %>% class()
#> 'numeric'
mtcars %>% select(cyl) %>% class()
#> 'data.frame'
So pull() creates a vector -- which, in this case, is numeric -- whereas select() creates a data frame.
Basically, pull() is the equivalent to writing mtcars$cyl or mtcars[, "cyl"], whereas select() removes all of the columns except for cyl but maintains the data frame structure
You could see select as an analogue of [ or magrittr::extract and pull as an analogue of [[ (or $) or magrittr::extract2 for data frames (an analogue of [[ for lists would be purr::pluck).
df <- iris %>% head
All of these give the same output:
df %>% pull(Sepal.Length)
df %>% pull("Sepal.Length")
a <- "Sepal.Length"; df %>% pull(!!quo(a))
df %>% extract2("Sepal.Length")
df %>% `[[`("Sepal.Length")
df[["Sepal.Length"]]
# all of them:
# [1] 5.1 4.9 4.7 4.6 5.0 5.4
And all of these give the same output:
df %>% select(Sepal.Length)
a <- "Sepal.Length"; df %>% select(!!quo(a))
df %>% select("Sepal.Length")
df %>% extract("Sepal.Length")
df %>% `[`("Sepal.Length")
df["Sepal.Length"]
# all of them:
# Sepal.Length
# 1 5.1
# 2 4.9
# 3 4.7
# 4 4.6
# 5 5.0
# 6 5.4
pull and select can take literal, character, or numeric indices, while the others take character or numeric only
One important thing is they differ on how they handle negative indices.
For select negative indices mean columns to drop.
For pull they mean count from last column.
df %>% pull(-Sepal.Length)
df %>% pull(-1)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Strange result but Sepal.Length is converted to 1, and column -1 is Species (last column)
This feature is not supported by [[ and extract2 :
df %>% `[[`(-1)
df %>% extract2(-1)
df[[-1]]
# Error in .subset2(x, i, exact = exact) :
# attempt to select more than one element in get1index <real>
Negative indices to drop columns are supported by [ and extract though.
df %>% select(-Sepal.Length)
df %>% select(-1)
df %>% `[`(-1)
df[-1]
# Sepal.Width Petal.Length Petal.Width Species
# 1 3.5 1.4 0.2 setosa
# 2 3.0 1.4 0.2 setosa
# 3 3.2 1.3 0.2 setosa
# 4 3.1 1.5 0.2 setosa
# 5 3.6 1.4 0.2 setosa
# 6 3.9 1.7 0.4 setosa
I have a problem that I can replicate using the iris dataset, where many groups (same prefix in name) of variables with two different suffixes. I want to be take a ratio for all these groups but can't find a tidyverse solution.. I would have through mutate_at() might have been able to help.
In the iris dataset you could consider for Petal columns I want to generate a Petal proportion of Length / Width. Similarly I want to do this for Sepal. I don't want to manually do this in a mutate() because I have lots of variable groups, and this could change over time.
I do have a solution that works using base R (in the code below) but I wanted to know if there was a tidyverse solution that achieved the same.
# libs ----
library(tidyverse)
# data ----
df <- iris
glimpse(df)
# set up column vectors ----
length_cols <- names(df) %>% str_subset("Length") %>% sort()
width_cols <- names(df) %>% str_subset("Width") %>% sort()
new_col_names <- names(df) %>% str_subset("Length") %>% str_replace(".Length", ".Ratio") %>% sort()
length_cols
width_cols
new_col_names
# make new cols ----
df[, new_col_names] <- df[, length_cols] / df[, width_cols]
df %>% head()
Thanks,
Gareth
Here is one possibility using purrr::map:
library(tidyverse);
df <- map(c("Petal", "Sepal"), ~ iris %>%
mutate(
!!paste0(.x, ".Ratio") := !!as.name(paste0(.x, ".Length")) / !!as.name(paste0(.x, ".Width")) )) %>%
reduce(left_join);
head(df);
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
#1 5.1 3.5 1.4 0.2 setosa 7.00
#2 4.9 3.0 1.4 0.2 setosa 7.00
#3 4.7 3.2 1.3 0.2 setosa 6.50
#4 4.6 3.1 1.5 0.2 setosa 7.50
#5 5.0 3.6 1.4 0.2 setosa 7.00
#6 5.4 3.9 1.7 0.4 setosa 4.25
# Sepal.Ratio
#1 1.457143
#2 1.633333
#3 1.468750
#4 1.483871
#5 1.388889
#6 1.384615
Explanation: We map the prefixes "Petal" and "Sepal" to iris by extracting for each prefix the columns with suffixes "Length" and "Width", and calculate a new corresponding prefix + ".Ratio" column; reduce merges both data.frames.
How can I use variables in place of column names in dplyr strings? As an example say I want to add a column to the iris dataset called sum that is the sum of Sepal.Length and Sepal.Width. In short I want a working version of the below code.
x = "Sepal.Length"
y = "Sepal.Width"
head(iris%>% mutate(sum = x+y))
Currently, running the code outputs "Evaluation error: non-numeric argument to binary operator" as R evaluates x and y as character vectors. How do I instead get R to evaluate x and y as column names of the dataframe? I know that the answer is to use some form of lazy evaluation, but I'm having trouble figuring out exactly how to configure it.
Note that the proposed duplicate: dplyr - mutate: use dynamic variable names does not address this issue. The duplicate answers this question:
Not my question: How do I do:
var = "sum"
head(iris %>% mutate(var = Sepal.Length + Sepal.Width))
I think that recommended way is using sym:
iris %>% mutate(sum = !!sym(x) + !!sym(y)) %>% head
It also works with get():
> rm(list = ls())
> data("iris")
>
> library(dplyr)
>
> x <- "Sepal.Length"
> y <- "Sepal.Width"
>
> head(iris %>% mutate(sum = get(x) + get(y)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
I want to rename a specific column with new name which comes as a variable in dplyr.
newName = paste0('nameY', 2017)
What I tried was
iris %>%
rename(newName = Petal.Length) %>%
head(2)
Which gives
Sepal.Length Sepal.Width newName Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
I am getting newName not nameY2017 which is normal. So I tried
iris %>%
rename_(eval(newName) = 'Petal.Length')
But then I am getting an error.
Error: unexpected '=' in "iris %>% rename_(eval(newName) ="
Is there a proper way to do it with dplyr?
I know I can do something like
names(iris)[3] <- newName
But that wouldn't be dplyr solution.
Credit and further information in this post for this dplyr 'rename' standard evaluation function not working as expected?
Your code:
newName = paste0('nameY', 2017)
iris %>%
rename(newName = Petal.Length) %>%
head(2)
Solution:
iris %>%
rename_(.dots = setNames("Petal.Length",newName)) %>%
head(2)
Output:
Sepal.Length Sepal.Width nameY2017 Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa